分类目录归档:知识问答

OpenCV-Python中的简单数字识别OCR

问题:OpenCV-Python中的简单数字识别OCR

我正在尝试在OpenCV-Python(cv2)中实现“数字识别OCR”。它仅用于学习目的。我想学习OpenCV中的KNearest和SVM功能。

我每个数字有100个样本(即图像)。我想和他们一起训练。

letter_recog.pyOpenCV示例附带一个示例。但是我仍然不知道如何使用它。我不了解样本,响应等内容。此外,它首先会加载txt文件,而我首先并不了解。

稍后进行搜索时,我可以在cpp样本中找到letter_recognitiontion.data。我用它并在letter_recog.py模型中为cv2.KNearest编写了一个代码(仅用于测试):

import numpy as np
import cv2

fn = 'letter-recognition.data'
a = np.loadtxt(fn, np.float32, delimiter=',', converters={ 0 : lambda ch : ord(ch)-ord('A') })
samples, responses = a[:,1:], a[:,0]

model = cv2.KNearest()
retval = model.train(samples,responses)
retval, results, neigh_resp, dists = model.find_nearest(samples, k = 10)
print results.ravel()

它给了我一个大小为20000的数组,我不知道它是什么。

问题:

1)什么是letter_recognition.data文件?如何从我自己的数据集中构建该文件?

2)results.reval()代表什么?

3)我们如何使用letter_recognition.data文件(KNearest或SVM)编写一个简单的数字识别工具?

I am trying to implement a “Digit Recognition OCR” in OpenCV-Python (cv2). It is just for learning purposes. I would like to learn both KNearest and SVM features in OpenCV.

I have 100 samples (i.e. images) of each digit. I would like to train with them.

There is a sample letter_recog.py that comes with OpenCV sample. But I still couldn’t figure out on how to use it. I don’t understand what are the samples, responses etc. Also, it loads a txt file at first, which I didn’t understand first.

Later on searching a little bit, I could find a letter_recognition.data in cpp samples. I used it and made a code for cv2.KNearest in the model of letter_recog.py (just for testing):

import numpy as np
import cv2

fn = 'letter-recognition.data'
a = np.loadtxt(fn, np.float32, delimiter=',', converters={ 0 : lambda ch : ord(ch)-ord('A') })
samples, responses = a[:,1:], a[:,0]

model = cv2.KNearest()
retval = model.train(samples,responses)
retval, results, neigh_resp, dists = model.find_nearest(samples, k = 10)
print results.ravel()

It gave me an array of size 20000, I don’t understand what it is.

Questions:

1) What is letter_recognition.data file? How to build that file from my own data set?

2) What does results.reval() denote?

3) How we can write a simple digit recognition tool using letter_recognition.data file (either KNearest or SVM)?


回答 0

好吧,我决定对我的问题进行锻炼以解决上述问题。我想要的是使用OpenCV中的KNearest或SVM功能实现简单的OCR。以下是我的工作方式。(这只是为了学习如何将KNearest用于简单的OCR目的)。

1)我的第一个问题是有关OpenCV示例随附的letter_recognition.data文件的。我想知道该文件中的内容。

它包含一个字母以及该字母的16个功能。

this SOF帮助我找到了它。本文介绍了这16个功能Letter Recognition Using Holland-Style Adaptive Classifiers。(尽管我不了解最后的一些功能)

2)由于我知道,如果不了解所有这些功能,就很难做到这一点。我尝试了其他一些论文,但是对于初学者来说,都有些困难。

So I just decided to take all the pixel values as my features. (我并不担心准确性或性能,我只是希望它能够工作,至少以最低的准确性)

我为训练数据拍摄了下图:

在此处输入图片说明

(我知道训练数据的数量较少。但是,由于所有字母的字体和大小都相同,因此我决定尝试一下)。

为了准备训练数据,我在OpenCV中编写了一个小代码。它执行以下操作:

  1. 它加载图像。
  2. 选择数字(显然是通过轮廓查找并在字母的面积和高度上施加约束来避免错误检测)。
  3. 围绕一个字母绘制边界矩形,然后等待key press manually。这次我们自己按对应于框中字母的数字键
  4. 一旦按下相应的数字键,它将将该框的大小调整为10×10,并在一个数组(此处为样本)中保存100个像素值,在另一个数组中(此处为响应)保存相应的手动输入的数字。
  5. 然后将两个数组保存在单独的txt文件中。

手动数字分类结束时,火车数据(train.png)中的所有数字都是由我们自己手动标记的,图像如下所示:

在此处输入图片说明

以下是我用于上述目的的代码(当然,不是很干净):

import sys

import numpy as np
import cv2

im = cv2.imread('pitrain.png')
im3 = im.copy()

gray = cv2.cvtColor(im,cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray,(5,5),0)
thresh = cv2.adaptiveThreshold(blur,255,1,1,11,2)

#################      Now finding Contours         ###################

contours,hierarchy = cv2.findContours(thresh,cv2.RETR_LIST,cv2.CHAIN_APPROX_SIMPLE)

samples =  np.empty((0,100))
responses = []
keys = [i for i in range(48,58)]

for cnt in contours:
    if cv2.contourArea(cnt)>50:
        [x,y,w,h] = cv2.boundingRect(cnt)

        if  h>28:
            cv2.rectangle(im,(x,y),(x+w,y+h),(0,0,255),2)
            roi = thresh[y:y+h,x:x+w]
            roismall = cv2.resize(roi,(10,10))
            cv2.imshow('norm',im)
            key = cv2.waitKey(0)

            if key == 27:  # (escape to quit)
                sys.exit()
            elif key in keys:
                responses.append(int(chr(key)))
                sample = roismall.reshape((1,100))
                samples = np.append(samples,sample,0)

responses = np.array(responses,np.float32)
responses = responses.reshape((responses.size,1))
print "training complete"

np.savetxt('generalsamples.data',samples)
np.savetxt('generalresponses.data',responses)

现在我们进入培训和测试部分。

为了测试零件,我使用了下面的图片,该图片与我训练过的字母具有相同的类型。

在此处输入图片说明

对于培训,我们执行以下操作

  1. 加载我们之前已经保存的txt文件
  2. 创建一个我们正在使用的分类器的实例(这里是KNearest)
  3. 然后我们使用KNearest.train函数来训练数据

出于测试目的,我们执行以下操作:

  1. 我们加载用于测试的图像
  2. 较早处理图像并使用轮廓法提取每个数字
  3. 为其绘制一个边界框,然后将其大小调整为10×10,并将其像素值存储在数组中,如之前所做的那样。
  4. 然后,我们使用KNearest.find_nearest()函数查找与我们给出的项目最接近的项目。(如果幸运,它将识别出正确的数字。)

我在下面的单个代码中包括了最后两个步骤(培训和测试):

import cv2
import numpy as np

#######   training part    ############### 
samples = np.loadtxt('generalsamples.data',np.float32)
responses = np.loadtxt('generalresponses.data',np.float32)
responses = responses.reshape((responses.size,1))

model = cv2.KNearest()
model.train(samples,responses)

############################# testing part  #########################

im = cv2.imread('pi.png')
out = np.zeros(im.shape,np.uint8)
gray = cv2.cvtColor(im,cv2.COLOR_BGR2GRAY)
thresh = cv2.adaptiveThreshold(gray,255,1,1,11,2)

contours,hierarchy = cv2.findContours(thresh,cv2.RETR_LIST,cv2.CHAIN_APPROX_SIMPLE)

for cnt in contours:
    if cv2.contourArea(cnt)>50:
        [x,y,w,h] = cv2.boundingRect(cnt)
        if  h>28:
            cv2.rectangle(im,(x,y),(x+w,y+h),(0,255,0),2)
            roi = thresh[y:y+h,x:x+w]
            roismall = cv2.resize(roi,(10,10))
            roismall = roismall.reshape((1,100))
            roismall = np.float32(roismall)
            retval, results, neigh_resp, dists = model.find_nearest(roismall, k = 1)
            string = str(int((results[0][0])))
            cv2.putText(out,string,(x,y+h),0,1,(0,255,0))

cv2.imshow('im',im)
cv2.imshow('out',out)
cv2.waitKey(0)

它奏效了,下面是我得到的结果:

在此处输入图片说明


在这里,它以100%的精度工作。我认为这是因为所有数字都是相同的种类和大小。

但是无论如何,这对于初学者来说是一个不错的开始(我希望如此)。

Well, I decided to workout myself on my question to solve above problem. What I wanted is to implement a simpl OCR using KNearest or SVM features in OpenCV. And below is what I did and how. ( it is just for learning how to use KNearest for simple OCR purposes).

1) My first question was about letter_recognition.data file that comes with OpenCV samples. I wanted to know what is inside that file.

It contains a letter, along with 16 features of that letter.

And this SOF helped me to find it. These 16 features are explained in the paperLetter Recognition Using Holland-Style Adaptive Classifiers. ( Although I didn’t understand some of the features at end)

2) Since I knew, without understanding all those features, it is difficult to do that method. I tried some other papers, but all were a little difficult for a beginner.

So I just decided to take all the pixel values as my features. (I was not worried about accuracy or performance, I just wanted it to work, at least with the least accuracy)

I took below image for my training data:

enter image description here

( I know the amount of training data is less. But, since all letters are of same font and size, I decided to try on this).

To prepare the data for training, I made a small code in OpenCV. It does following things:

  1. It loads the image.
  2. Selects the digits ( obviously by contour finding and applying constraints on area and height of letters to avoid false detections).
  3. Draws the bounding rectangle around one letter and wait for key press manually. This time we press the digit key ourselves corresponding to the letter in box.
  4. Once corresponding digit key is pressed, it resizes this box to 10×10 and saves 100 pixel values in an array (here, samples) and corresponding manually entered digit in another array(here, responses).
  5. Then save both the arrays in separate txt files.

At the end of manual classification of digits, all the digits in the train data( train.png) are labeled manually by ourselves, image will look like below:

enter image description here

Below is the code I used for above purpose ( of course, not so clean):

import sys

import numpy as np
import cv2

im = cv2.imread('pitrain.png')
im3 = im.copy()

gray = cv2.cvtColor(im,cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray,(5,5),0)
thresh = cv2.adaptiveThreshold(blur,255,1,1,11,2)

#################      Now finding Contours         ###################

contours,hierarchy = cv2.findContours(thresh,cv2.RETR_LIST,cv2.CHAIN_APPROX_SIMPLE)

samples =  np.empty((0,100))
responses = []
keys = [i for i in range(48,58)]

for cnt in contours:
    if cv2.contourArea(cnt)>50:
        [x,y,w,h] = cv2.boundingRect(cnt)

        if  h>28:
            cv2.rectangle(im,(x,y),(x+w,y+h),(0,0,255),2)
            roi = thresh[y:y+h,x:x+w]
            roismall = cv2.resize(roi,(10,10))
            cv2.imshow('norm',im)
            key = cv2.waitKey(0)

            if key == 27:  # (escape to quit)
                sys.exit()
            elif key in keys:
                responses.append(int(chr(key)))
                sample = roismall.reshape((1,100))
                samples = np.append(samples,sample,0)

responses = np.array(responses,np.float32)
responses = responses.reshape((responses.size,1))
print "training complete"

np.savetxt('generalsamples.data',samples)
np.savetxt('generalresponses.data',responses)

Now we enter in to training and testing part.

For testing part I used below image, which has same type of letters I used to train.

enter image description here

For training we do as follows:

  1. Load the txt files we already saved earlier
  2. create a instance of classifier we are using ( here, it is KNearest)
  3. Then we use KNearest.train function to train the data

For testing purposes, we do as follows:

  1. We load the image used for testing
  2. process the image as earlier and extract each digit using contour methods
  3. Draw bounding box for it, then resize to 10×10, and store its pixel values in an array as done earlier.
  4. Then we use KNearest.find_nearest() function to find the nearest item to the one we gave. ( If lucky, it recognises the correct digit.)

I included last two steps ( training and testing) in single code below:

import cv2
import numpy as np

#######   training part    ############### 
samples = np.loadtxt('generalsamples.data',np.float32)
responses = np.loadtxt('generalresponses.data',np.float32)
responses = responses.reshape((responses.size,1))

model = cv2.KNearest()
model.train(samples,responses)

############################# testing part  #########################

im = cv2.imread('pi.png')
out = np.zeros(im.shape,np.uint8)
gray = cv2.cvtColor(im,cv2.COLOR_BGR2GRAY)
thresh = cv2.adaptiveThreshold(gray,255,1,1,11,2)

contours,hierarchy = cv2.findContours(thresh,cv2.RETR_LIST,cv2.CHAIN_APPROX_SIMPLE)

for cnt in contours:
    if cv2.contourArea(cnt)>50:
        [x,y,w,h] = cv2.boundingRect(cnt)
        if  h>28:
            cv2.rectangle(im,(x,y),(x+w,y+h),(0,255,0),2)
            roi = thresh[y:y+h,x:x+w]
            roismall = cv2.resize(roi,(10,10))
            roismall = roismall.reshape((1,100))
            roismall = np.float32(roismall)
            retval, results, neigh_resp, dists = model.find_nearest(roismall, k = 1)
            string = str(int((results[0][0])))
            cv2.putText(out,string,(x,y+h),0,1,(0,255,0))

cv2.imshow('im',im)
cv2.imshow('out',out)
cv2.waitKey(0)

And it worked, below is the result I got:

enter image description here


Here it worked with 100% accuracy. I assume this is because all the digits are of same kind and same size.

But any way, this is a good start to go for beginners ( I hope so).


回答 1

对于那些对C ++代码感兴趣的人,可以参考以下代码。感谢Abid Rahman的出色解释。


步骤与上面相同,但是轮廓查找仅使用第一层次级别的轮廓,因此算法仅对每个数字使用外部轮廓。

用于创建样本和标签数据的代码

//Process image to extract contour
Mat thr,gray,con;
Mat src=imread("digit.png",1);
cvtColor(src,gray,CV_BGR2GRAY);
threshold(gray,thr,200,255,THRESH_BINARY_INV); //Threshold to find contour
thr.copyTo(con);

// Create sample and label data
vector< vector <Point> > contours; // Vector for storing contour
vector< Vec4i > hierarchy;
Mat sample;
Mat response_array;  
findContours( con, contours, hierarchy,CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE ); //Find contour

for( int i = 0; i< contours.size(); i=hierarchy[i][0] ) // iterate through first hierarchy level contours
{
    Rect r= boundingRect(contours[i]); //Find bounding rect for each contour
    rectangle(src,Point(r.x,r.y), Point(r.x+r.width,r.y+r.height), Scalar(0,0,255),2,8,0);
    Mat ROI = thr(r); //Crop the image
    Mat tmp1, tmp2;
    resize(ROI,tmp1, Size(10,10), 0,0,INTER_LINEAR ); //resize to 10X10
    tmp1.convertTo(tmp2,CV_32FC1); //convert to float
    sample.push_back(tmp2.reshape(1,1)); // Store  sample data
    imshow("src",src);
    int c=waitKey(0); // Read corresponding label for contour from keyoard
    c-=0x30;     // Convert ascii to intiger value
    response_array.push_back(c); // Store label to a mat
    rectangle(src,Point(r.x,r.y), Point(r.x+r.width,r.y+r.height), Scalar(0,255,0),2,8,0);    
}

// Store the data to file
Mat response,tmp;
tmp=response_array.reshape(1,1); //make continuous
tmp.convertTo(response,CV_32FC1); // Convert  to float

FileStorage Data("TrainingData.yml",FileStorage::WRITE); // Store the sample data in a file
Data << "data" << sample;
Data.release();

FileStorage Label("LabelData.yml",FileStorage::WRITE); // Store the label data in a file
Label << "label" << response;
Label.release();
cout<<"Training and Label data created successfully....!! "<<endl;

imshow("src",src);
waitKey();

培训和测试代码

Mat thr,gray,con;
Mat src=imread("dig.png",1);
cvtColor(src,gray,CV_BGR2GRAY);
threshold(gray,thr,200,255,THRESH_BINARY_INV); // Threshold to create input
thr.copyTo(con);


// Read stored sample and label for training
Mat sample;
Mat response,tmp;
FileStorage Data("TrainingData.yml",FileStorage::READ); // Read traing data to a Mat
Data["data"] >> sample;
Data.release();

FileStorage Label("LabelData.yml",FileStorage::READ); // Read label data to a Mat
Label["label"] >> response;
Label.release();


KNearest knn;
knn.train(sample,response); // Train with sample and responses
cout<<"Training compleated.....!!"<<endl;

vector< vector <Point> > contours; // Vector for storing contour
vector< Vec4i > hierarchy;

//Create input sample by contour finding and cropping
findContours( con, contours, hierarchy,CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE );
Mat dst(src.rows,src.cols,CV_8UC3,Scalar::all(0));

for( int i = 0; i< contours.size(); i=hierarchy[i][0] ) // iterate through each contour for first hierarchy level .
{
    Rect r= boundingRect(contours[i]);
    Mat ROI = thr(r);
    Mat tmp1, tmp2;
    resize(ROI,tmp1, Size(10,10), 0,0,INTER_LINEAR );
    tmp1.convertTo(tmp2,CV_32FC1);
    float p=knn.find_nearest(tmp2.reshape(1,1), 1);
    char name[4];
    sprintf(name,"%d",(int)p);
    putText( dst,name,Point(r.x,r.y+r.height) ,0,1, Scalar(0, 255, 0), 2, 8 );
}

imshow("src",src);
imshow("dst",dst);
imwrite("dest.jpg",dst);
waitKey();

结果

结果,第一行中的点被检测为8,而我们尚未训练该点。另外,我正在考虑将第一个层次结构中的每个轮廓作为样本输入,用户可以通过计算面积来避免它。

结果

For those who interested in C++ code can refer below code. Thanks Abid Rahman for the nice explanation.


The procedure is same as above but, the contour finding uses only first hierarchy level contour, so that the algorithm uses only outer contour for each digit.

Code for creating sample and Label data

//Process image to extract contour
Mat thr,gray,con;
Mat src=imread("digit.png",1);
cvtColor(src,gray,CV_BGR2GRAY);
threshold(gray,thr,200,255,THRESH_BINARY_INV); //Threshold to find contour
thr.copyTo(con);

// Create sample and label data
vector< vector <Point> > contours; // Vector for storing contour
vector< Vec4i > hierarchy;
Mat sample;
Mat response_array;  
findContours( con, contours, hierarchy,CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE ); //Find contour

for( int i = 0; i< contours.size(); i=hierarchy[i][0] ) // iterate through first hierarchy level contours
{
    Rect r= boundingRect(contours[i]); //Find bounding rect for each contour
    rectangle(src,Point(r.x,r.y), Point(r.x+r.width,r.y+r.height), Scalar(0,0,255),2,8,0);
    Mat ROI = thr(r); //Crop the image
    Mat tmp1, tmp2;
    resize(ROI,tmp1, Size(10,10), 0,0,INTER_LINEAR ); //resize to 10X10
    tmp1.convertTo(tmp2,CV_32FC1); //convert to float
    sample.push_back(tmp2.reshape(1,1)); // Store  sample data
    imshow("src",src);
    int c=waitKey(0); // Read corresponding label for contour from keyoard
    c-=0x30;     // Convert ascii to intiger value
    response_array.push_back(c); // Store label to a mat
    rectangle(src,Point(r.x,r.y), Point(r.x+r.width,r.y+r.height), Scalar(0,255,0),2,8,0);    
}

// Store the data to file
Mat response,tmp;
tmp=response_array.reshape(1,1); //make continuous
tmp.convertTo(response,CV_32FC1); // Convert  to float

FileStorage Data("TrainingData.yml",FileStorage::WRITE); // Store the sample data in a file
Data << "data" << sample;
Data.release();

FileStorage Label("LabelData.yml",FileStorage::WRITE); // Store the label data in a file
Label << "label" << response;
Label.release();
cout<<"Training and Label data created successfully....!! "<<endl;

imshow("src",src);
waitKey();

Code for training and testing

Mat thr,gray,con;
Mat src=imread("dig.png",1);
cvtColor(src,gray,CV_BGR2GRAY);
threshold(gray,thr,200,255,THRESH_BINARY_INV); // Threshold to create input
thr.copyTo(con);


// Read stored sample and label for training
Mat sample;
Mat response,tmp;
FileStorage Data("TrainingData.yml",FileStorage::READ); // Read traing data to a Mat
Data["data"] >> sample;
Data.release();

FileStorage Label("LabelData.yml",FileStorage::READ); // Read label data to a Mat
Label["label"] >> response;
Label.release();


KNearest knn;
knn.train(sample,response); // Train with sample and responses
cout<<"Training compleated.....!!"<<endl;

vector< vector <Point> > contours; // Vector for storing contour
vector< Vec4i > hierarchy;

//Create input sample by contour finding and cropping
findContours( con, contours, hierarchy,CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE );
Mat dst(src.rows,src.cols,CV_8UC3,Scalar::all(0));

for( int i = 0; i< contours.size(); i=hierarchy[i][0] ) // iterate through each contour for first hierarchy level .
{
    Rect r= boundingRect(contours[i]);
    Mat ROI = thr(r);
    Mat tmp1, tmp2;
    resize(ROI,tmp1, Size(10,10), 0,0,INTER_LINEAR );
    tmp1.convertTo(tmp2,CV_32FC1);
    float p=knn.find_nearest(tmp2.reshape(1,1), 1);
    char name[4];
    sprintf(name,"%d",(int)p);
    putText( dst,name,Point(r.x,r.y+r.height) ,0,1, Scalar(0, 255, 0), 2, 8 );
}

imshow("src",src);
imshow("dst",dst);
imwrite("dest.jpg",dst);
waitKey();

Result

In the result the dot in the first line is detected as 8 and we haven’t trained for dot. Also I am considering every contour in first hierarchy level as the sample input, user can avoid it by computing the area.

Results


回答 2

如果您对机器学习的最新技术感兴趣,则应研究深度学习。您应该具有支持GPU的CUDA,或者在Amazon Web Services上使用GPU。

Google Udacity使用Tensor Flow对此提供了很好的教程。本教程将教您如何在手写数字上训练自己的分类器。使用卷积网络,我在测试集上的准确性超过97%。

If you are interested in the state of the art in Machine Learning, you should look into Deep Learning. You should have a CUDA supporting GPU or alternatively use the GPU on Amazon Web Services.

Google Udacity has a nice tutorial on this using Tensor Flow. This tutorial will teach you how to train your own classifier on hand written digits. I got an accuracy of over 97% on the test set using Convolutional Networks.


如何比较两个日期?

问题:如何比较两个日期?

如何使用Python比较两个日期以查看稍后的日期?

例如,我想检查当前日期是否超过了我在假期中创建的此列表中的最后日期,因此它将自动发送电子邮件,告诉管理员更新holiday.txt文件。

How would I compare two dates to see which is later, using Python?

For example, I want to check if the current date is past the last date in this list I am creating, of holiday dates, so that it will send an email automatically, telling the admin to update the holiday.txt file.


回答 0

使用datetime方法和运算符<及其种类。

>>> from datetime import datetime, timedelta
>>> past = datetime.now() - timedelta(days=1)
>>> present = datetime.now()
>>> past < present
True
>>> datetime(3000, 1, 1) < present
False
>>> present - datetime(2000, 4, 4)
datetime.timedelta(4242, 75703, 762105)

Use the datetime method and the operator < and its kin.

>>> from datetime import datetime, timedelta
>>> past = datetime.now() - timedelta(days=1)
>>> present = datetime.now()
>>> past < present
True
>>> datetime(3000, 1, 1) < present
False
>>> present - datetime(2000, 4, 4)
datetime.timedelta(4242, 75703, 762105)

回答 1

采用 time

假设您的初始日期是像这样的字符串:
date1 = "31/12/2015"
date2 = "01/01/2016"

您可以执行以下操作:
newdate1 = time.strptime(date1, "%d/%m/%Y")并将newdate2 = time.strptime(date2, "%d/%m/%Y")其转换为python的日期格式。然后,比较是显而易见的:

newdate1 > newdate2will return False
newdate1 < newdate2will returnTrue

Use time

Let’s say you have the initial dates as strings like these:
date1 = "31/12/2015"
date2 = "01/01/2016"

You can do the following:
newdate1 = time.strptime(date1, "%d/%m/%Y") and newdate2 = time.strptime(date2, "%d/%m/%Y") to convert them to python’s date format. Then, the comparison is obvious:

newdate1 > newdate2 will return False
newdate1 < newdate2 will return True


回答 2

datetime.date(2011, 1, 1) < datetime.date(2011, 1, 2)会回来的True

datetime.date(2011, 1, 1) - datetime.date(2011, 1, 2)会回来的datetime.timedelta(-1)

datetime.date(2011, 1, 1) + datetime.date(2011, 1, 2)会回来的datetime.timedelta(1)

请参阅文档

datetime.date(2011, 1, 1) < datetime.date(2011, 1, 2) will return True.

datetime.date(2011, 1, 1) - datetime.date(2011, 1, 2) will return datetime.timedelta(-1).

datetime.date(2011, 1, 1) + datetime.date(2011, 1, 2) will return datetime.timedelta(1).

see the docs.


回答 3

使用datetime和比较的其他答案也仅适用于时间,没有日期。

例如,要检查现在是否大于或小于8:00,我们可以使用:

import datetime

eight_am = datetime.time( 8,0,0 ) # Time, without a date

然后与之比较:

datetime.datetime.now().time() > eight_am  

这将返回 True

Other answers using datetime and comparisons also work for time only, without a date.

For example, to check if right now it is more or less than 8:00 a.m., we can use:

import datetime

eight_am = datetime.time( 8,0,0 ) # Time, without a date

And later compare with:

datetime.datetime.now().time() > eight_am  

which will return True


回答 4

要计算两个日期之间的天差,可以按照以下步骤进行:

import datetime
import math

issuedate = datetime(2019,5,9)   #calculate the issue datetime
current_date = datetime.datetime.now() #calculate the current datetime
diff_date = current_date - issuedate #//calculate the date difference with time also
amount = fine  #you want change

if diff_date.total_seconds() > 0.0:   #its matching your condition
    days = math.ceil(diff_date.total_seconds()/86400)  #calculate days (in 
    one day 86400 seconds)
    deductable_amount = round(amount,2)*days #calclulated fine for all days

因为如果在截止日期之前还有一秒钟多的时间,那么我们必须收费

For calculating days in two dates difference, can be done like below:

import datetime
import math

issuedate = datetime(2019,5,9)   #calculate the issue datetime
current_date = datetime.datetime.now() #calculate the current datetime
diff_date = current_date - issuedate #//calculate the date difference with time also
amount = fine  #you want change

if diff_date.total_seconds() > 0.0:   #its matching your condition
    days = math.ceil(diff_date.total_seconds()/86400)  #calculate days (in 
    one day 86400 seconds)
    deductable_amount = round(amount,2)*days #calclulated fine for all days

Becuase if one second is more with the due date then we have to charge


如何比较python中的两个列表并返回匹配项

问题:如何比较python中的两个列表并返回匹配项

我想获取两个列表并查找两个列表中都出现的值。

a = [1, 2, 3, 4, 5]
b = [9, 8, 7, 6, 5]

returnMatches(a, b)

[5]例如,将返回。

I want to take two lists and find the values that appear in both.

a = [1, 2, 3, 4, 5]
b = [9, 8, 7, 6, 5]

returnMatches(a, b)

would return [5], for instance.


回答 0

不是最有效的方法,但到目前为止,最明显的方法是:

>>> a = [1, 2, 3, 4, 5]
>>> b = [9, 8, 7, 6, 5]
>>> set(a) & set(b)
{5}

如果订单很重要,您可以使用以下列表理解方法:

>>> [i for i, j in zip(a, b) if i == j]
[5]

(仅适用于大小相等的列表,这意味着顺序意义)。

Not the most efficient one, but by far the most obvious way to do it is:

>>> a = [1, 2, 3, 4, 5]
>>> b = [9, 8, 7, 6, 5]
>>> set(a) & set(b)
{5}

if order is significant you can do it with list comprehensions like this:

>>> [i for i, j in zip(a, b) if i == j]
[5]

(only works for equal-sized lists, which order-significance implies).


回答 1

使用set.intersection(),它快速且可读。

>>> set(a).intersection(b)
set([5])

Use set.intersection(), it’s fast and readable.

>>> set(a).intersection(b)
set([5])

回答 2

快速的性能测试表明Lutz的解决方案是最好的:

import time

def speed_test(func):
    def wrapper(*args, **kwargs):
        t1 = time.time()
        for x in xrange(5000):
            results = func(*args, **kwargs)
        t2 = time.time()
        print '%s took %0.3f ms' % (func.func_name, (t2-t1)*1000.0)
        return results
    return wrapper

@speed_test
def compare_bitwise(x, y):
    set_x = frozenset(x)
    set_y = frozenset(y)
    return set_x & set_y

@speed_test
def compare_listcomp(x, y):
    return [i for i, j in zip(x, y) if i == j]

@speed_test
def compare_intersect(x, y):
    return frozenset(x).intersection(y)

# Comparing short lists
a = [1, 2, 3, 4, 5]
b = [9, 8, 7, 6, 5]
compare_bitwise(a, b)
compare_listcomp(a, b)
compare_intersect(a, b)

# Comparing longer lists
import random
a = random.sample(xrange(100000), 10000)
b = random.sample(xrange(100000), 10000)
compare_bitwise(a, b)
compare_listcomp(a, b)
compare_intersect(a, b)

这些是我机器上的结果:

# Short list:
compare_bitwise took 10.145 ms
compare_listcomp took 11.157 ms
compare_intersect took 7.461 ms

# Long list:
compare_bitwise took 11203.709 ms
compare_listcomp took 17361.736 ms
compare_intersect took 6833.768 ms

显然,任何人工性能测试都应以一粒盐进行,但由于 set().intersection()答案至少与其他解决方案一样快,并且也是最易读的,因此它应该是解决此常见问题的标准解决方案。

A quick performance test showing Lutz’s solution is the best:

import time

def speed_test(func):
    def wrapper(*args, **kwargs):
        t1 = time.time()
        for x in xrange(5000):
            results = func(*args, **kwargs)
        t2 = time.time()
        print '%s took %0.3f ms' % (func.func_name, (t2-t1)*1000.0)
        return results
    return wrapper

@speed_test
def compare_bitwise(x, y):
    set_x = frozenset(x)
    set_y = frozenset(y)
    return set_x & set_y

@speed_test
def compare_listcomp(x, y):
    return [i for i, j in zip(x, y) if i == j]

@speed_test
def compare_intersect(x, y):
    return frozenset(x).intersection(y)

# Comparing short lists
a = [1, 2, 3, 4, 5]
b = [9, 8, 7, 6, 5]
compare_bitwise(a, b)
compare_listcomp(a, b)
compare_intersect(a, b)

# Comparing longer lists
import random
a = random.sample(xrange(100000), 10000)
b = random.sample(xrange(100000), 10000)
compare_bitwise(a, b)
compare_listcomp(a, b)
compare_intersect(a, b)

These are the results on my machine:

# Short list:
compare_bitwise took 10.145 ms
compare_listcomp took 11.157 ms
compare_intersect took 7.461 ms

# Long list:
compare_bitwise took 11203.709 ms
compare_listcomp took 17361.736 ms
compare_intersect took 6833.768 ms

Obviously, any artificial performance test should be taken with a grain of salt, but since the set().intersection() answer is at least as fast as the other solutions, and also the most readable, it should be the standard solution for this common problem.


回答 3

我更喜欢基于集合的答案,但这还是可行的

[x for x in a if x in b]

I prefer the set based answers, but here’s one that works anyway

[x for x in a if x in b]

回答 4

最简单的方法是使用set

>>> a = [1, 2, 3, 4, 5]
>>> b = [9, 8, 7, 6, 5]
>>> set(a) & set(b)
set([5])

The easiest way to do that is to use sets:

>>> a = [1, 2, 3, 4, 5]
>>> b = [9, 8, 7, 6, 5]
>>> set(a) & set(b)
set([5])

回答 5

快速方式:

list(set(a).intersection(set(b)))

Quick way:

list(set(a).intersection(set(b)))

回答 6

>>> s = ['a','b','c']   
>>> f = ['a','b','d','c']  
>>> ss= set(s)  
>>> fs =set(f)  
>>> print ss.intersection(fs)   
   **set(['a', 'c', 'b'])**  
>>> print ss.union(fs)        
   **set(['a', 'c', 'b', 'd'])**  
>>> print ss.union(fs)  - ss.intersection(fs)   
   **set(['d'])**
>>> s = ['a','b','c']   
>>> f = ['a','b','d','c']  
>>> ss= set(s)  
>>> fs =set(f)  
>>> print ss.intersection(fs)   
   **set(['a', 'c', 'b'])**  
>>> print ss.union(fs)        
   **set(['a', 'c', 'b', 'd'])**  
>>> print ss.union(fs)  - ss.intersection(fs)   
   **set(['d'])**

回答 7

您也可以通过将公共元素保留在新列表中来尝试此操作。

new_list = []
for element in a:
    if element in b:
        new_list.append(element)

Also you can try this,by keeping common elements in a new list.

new_list = []
for element in a:
    if element in b:
        new_list.append(element)

回答 8

您要重复吗?如果不是,您应该使用集合:


>>> set([1, 2, 3, 4, 5]).intersection(set([9, 8, 7, 6, 5]))
set([5])

Do you want duplicates? If not maybe you should use sets instead:


>>> set([1, 2, 3, 4, 5]).intersection(set([9, 8, 7, 6, 5]))
set([5])

回答 9

检查对象的深度为1且保持顺序的列表1(lst1)和列表2(lst2)的列表相等性的另一种实用方式是:

all(i == j for i, j in zip(lst1, lst2))   

another a bit more functional way to check list equality for list 1 (lst1) and list 2 (lst2) where objects have depth one and which keeps the order is:

all(i == j for i, j in zip(lst1, lst2))   

回答 10

a = [1, 2, 3, 4, 5]
b = [9, 8, 7, 6, 5]

lista =set(a)
listb =set(b)   
print listb.intersection(lista)   
returnMatches = set(['5']) #output 

print " ".join(str(return) for return in returnMatches ) # remove the set()   

 5        #final output 
a = [1, 2, 3, 4, 5]
b = [9, 8, 7, 6, 5]

lista =set(a)
listb =set(b)   
print listb.intersection(lista)   
returnMatches = set(['5']) #output 

print " ".join(str(return) for return in returnMatches ) # remove the set()   

 5        #final output 

回答 11

也可以使用itertools.product。

>>> common_elements=[]
>>> for i in list(itertools.product(a,b)):
...     if i[0] == i[1]:
...         common_elements.append(i[0])

Can use itertools.product too.

>>> common_elements=[]
>>> for i in list(itertools.product(a,b)):
...     if i[0] == i[1]:
...         common_elements.append(i[0])

回答 12

您可以使用

def returnMatches(a,b):
       return list(set(a) & set(b))

You can use

def returnMatches(a,b):
       return list(set(a) & set(b))

回答 13

您可以使用:

a = [1, 3, 4, 5, 9, 6, 7, 8]
b = [1, 7, 0, 9]
same_values = set(a) & set(b)
print same_values

输出:

set([1, 7, 9])

You can use:

a = [1, 3, 4, 5, 9, 6, 7, 8]
b = [1, 7, 0, 9]
same_values = set(a) & set(b)
print same_values

Output:

set([1, 7, 9])

回答 14

如果要布尔值:

>>> a = [1, 2, 3, 4, 5]
>>> b = [9, 8, 7, 6, 5]
>>> set(b) == set(a)  & set(b) and set(a) == set(a) & set(b)
False
>>> a = [3,1,2]
>>> b = [1,2,3]
>>> set(b) == set(a)  & set(b) and set(a) == set(a) & set(b)
True

If you want a boolean value:

>>> a = [1, 2, 3, 4, 5]
>>> b = [9, 8, 7, 6, 5]
>>> set(b) == set(a)  & set(b) and set(a) == set(a) & set(b)
False
>>> a = [3,1,2]
>>> b = [1,2,3]
>>> set(b) == set(a)  & set(b) and set(a) == set(a) & set(b)
True

回答 15

以下解决方案适用于任何顺序的列表项,并且还支持两个列表的长度不同。

import numpy as np
def getMatches(a, b):
    matches = []
    unique_a = np.unique(a)
    unique_b = np.unique(b)
    for a in unique_a:
        for b in unique_b:
            if a == b:
                matches.append(a)
    return matches
print(getMatches([1, 2, 3, 4, 5], [9, 8, 7, 6, 5, 9])) # displays [5]
print(getMatches([1, 2, 3], [3, 4, 5, 1])) # displays [1, 3]

The following solution works for any order of list items and also supports both lists to be different length.

import numpy as np
def getMatches(a, b):
    matches = []
    unique_a = np.unique(a)
    unique_b = np.unique(b)
    for a in unique_a:
        for b in unique_b:
            if a == b:
                matches.append(a)
    return matches
print(getMatches([1, 2, 3, 4, 5], [9, 8, 7, 6, 5, 9])) # displays [5]
print(getMatches([1, 2, 3], [3, 4, 5, 1])) # displays [1, 3]

回答 16

使用__and__属性方法也可以。

>>> a = [1, 2, 3, 4, 5]
>>> b = [9, 8, 7, 6, 5]
>>> set(a).__and__(set(b))
set([5])

或简单地

>>> set([1, 2, 3, 4, 5]).__and__(set([9, 8, 7, 6, 5]))
set([5])
>>>    

Using __and__ attribute method also works.

>>> a = [1, 2, 3, 4, 5]
>>> b = [9, 8, 7, 6, 5]
>>> set(a).__and__(set(b))
set([5])

or simply

>>> set([1, 2, 3, 4, 5]).__and__(set([9, 8, 7, 6, 5]))
set([5])
>>>    

回答 17

you can | for set union and & for set intersection.
for example:

    set1={1,2,3}
    set2={3,4,5}
    print(set1&set2)
    output=3

    set1={1,2,3}
    set2={3,4,5}
    print(set1|set2)
    output=1,2,3,4,5

curly braces in the answer.
you can | for set union and & for set intersection.
for example:

    set1={1,2,3}
    set2={3,4,5}
    print(set1&set2)
    output=3

    set1={1,2,3}
    set2={3,4,5}
    print(set1|set2)
    output=1,2,3,4,5

curly braces in the answer.

回答 18

我只使用了以下内容,它对我有用:

group1 = [1, 2, 3, 4, 5]
group2 = [9, 8, 7, 6, 5]

for k in group1:
    for v in group2:
        if k == v:
            print(k)

这将在您的情况下打印5。可能不是明智的表现。

I just used the following and it worked for me:

group1 = [1, 2, 3, 4, 5]
group2 = [9, 8, 7, 6, 5]

for k in group1:
    for v in group2:
        if k == v:
            print(k)

this would then print 5 in your case. Probably not great performance wise though.


什么是备忘录,如何在Python中使用备忘录?

问题:什么是备忘录,如何在Python中使用备忘录?

我刚开始使用Python,却不知道什么是记忆以及如何使用它。另外,我可以举一个简化的例子吗?

I just started Python and I’ve got no idea what memoization is and how to use it. Also, may I have a simplified example?


回答 0

记忆有效地指基于方法输入记忆方法调用的结果(“记忆”→“备忘录”→要记忆),然后返回记忆的结果,而不是再次计算结果。您可以将其视为方法结果的缓存。有关更多详细信息,请参见第387页,Cormen等人的算法简介(3e)中的定义。

一个简单的示例,使用Python中的记忆来计算阶乘是这样的:

factorial_memo = {}
def factorial(k):
    if k < 2: return 1
    if k not in factorial_memo:
        factorial_memo[k] = k * factorial(k-1)
    return factorial_memo[k]

您可能会变得更加复杂,并将备注过程封装到一个类中:

class Memoize:
    def __init__(self, f):
        self.f = f
        self.memo = {}
    def __call__(self, *args):
        if not args in self.memo:
            self.memo[args] = self.f(*args)
        #Warning: You may wish to do a deepcopy here if returning objects
        return self.memo[args]

然后:

def factorial(k):
    if k < 2: return 1
    return k * factorial(k - 1)

factorial = Memoize(factorial)

Python 2.4中添加了一个称为“ 装饰器 ”的功能,使您现在只需编写以下代码即可完成同一操作:

@Memoize
def factorial(k):
    if k < 2: return 1
    return k * factorial(k - 1)

Python的装饰图书馆有一个名为类似的装饰memoized是不是稍微更强大的是Memoize这里显示类。

Memoization effectively refers to remembering (“memoization” → “memorandum” → to be remembered) results of method calls based on the method inputs and then returning the remembered result rather than computing the result again. You can think of it as a cache for method results. For further details, see page 387 for the definition in Introduction To Algorithms (3e), Cormen et al.

A simple example for computing factorials using memoization in Python would be something like this:

factorial_memo = {}
def factorial(k):
    if k < 2: return 1
    if k not in factorial_memo:
        factorial_memo[k] = k * factorial(k-1)
    return factorial_memo[k]

You can get more complicated and encapsulate the memoization process into a class:

class Memoize:
    def __init__(self, f):
        self.f = f
        self.memo = {}
    def __call__(self, *args):
        if not args in self.memo:
            self.memo[args] = self.f(*args)
        #Warning: You may wish to do a deepcopy here if returning objects
        return self.memo[args]

Then:

def factorial(k):
    if k < 2: return 1
    return k * factorial(k - 1)

factorial = Memoize(factorial)

A feature known as “decorators” was added in Python 2.4 which allow you to now simply write the following to accomplish the same thing:

@Memoize
def factorial(k):
    if k < 2: return 1
    return k * factorial(k - 1)

The Python Decorator Library has a similar decorator called memoized that is slightly more robust than the Memoize class shown here.


回答 1

是Python 3.2的新功能functools.lru_cache。默认情况下,它仅缓存最近使用的128个调用,但是您可以将设置为maxsizeNone以指示缓存永不过期:

import functools

@functools.lru_cache(maxsize=None)
def fib(num):
    if num < 2:
        return num
    else:
        return fib(num-1) + fib(num-2)

此功能本身非常慢,请尝试fib(36),您将需要等待大约十秒钟。

添加lru_cache注释可确保如果最近已为特定值调用了该函数,则该函数不会重新计算该值,而是使用缓存的先前结果。在这种情况下,它可以极大地提高速度,而代码不会因缓存的细节而混乱。

New to Python 3.2 is functools.lru_cache. By default, it only caches the 128 most recently used calls, but you can set the maxsize to None to indicate that the cache should never expire:

import functools

@functools.lru_cache(maxsize=None)
def fib(num):
    if num < 2:
        return num
    else:
        return fib(num-1) + fib(num-2)

This function by itself is very slow, try fib(36) and you will have to wait about ten seconds.

Adding lru_cache annotation ensures that if the function has been called recently for a particular value, it will not recompute that value, but use a cached previous result. In this case, it leads to a tremendous speed improvement, while the code is not cluttered with the details of caching.


回答 2

其他答案涵盖了它相当不错的地方。我不再重复。只是一些对您可能有用的观点。

通常,备注是一项操作,您可以将其应用于可计算某些内容(昂贵)并返回值的任何函数。因此,通常将其实现为装饰器。实现很简单,就像这样

memoised_function = memoise(actual_function)

或表示为装饰者

@memoise
def actual_function(arg1, arg2):
   #body

The other answers cover what it is quite well. I’m not repeating that. Just some points that might be useful to you.

Usually, memoisation is an operation you can apply on any function that computes something (expensive) and returns a value. Because of this, it’s often implemented as a decorator. The implementation is straightforward and it would be something like this

memoised_function = memoise(actual_function)

or expressed as a decorator

@memoise
def actual_function(arg1, arg2):
   #body

回答 3

备注保持了昂贵的计算结果并返回缓存的结果,而不是不断地重新计算它。

这是一个例子:

def doSomeExpensiveCalculation(self, input):
    if input not in self.cache:
        <do expensive calculation>
        self.cache[input] = result
    return self.cache[input]

有关备注的详细信息,请参见Wikipedia条目

Memoization is keeping the results of expensive calculations and returning the cached result rather than continuously recalculating it.

Here’s an example:

def doSomeExpensiveCalculation(self, input):
    if input not in self.cache:
        <do expensive calculation>
        self.cache[input] = result
    return self.cache[input]

A more complete description can be found in the wikipedia entry on memoization.


回答 4

hasattr对于想手工制作的人,不要忘了内置功能。这样,您可以将mem缓存保留在函数定义内(而不是全局)。

def fact(n):
    if not hasattr(fact, 'mem'):
        fact.mem = {1: 1}
    if not n in fact.mem:
        fact.mem[n] = n * fact(n - 1)
    return fact.mem[n]

Let’s not forget the built-in hasattr function, for those who want to hand-craft. That way you can keep the mem cache inside the function definition (as opposed to a global).

def fact(n):
    if not hasattr(fact, 'mem'):
        fact.mem = {1: 1}
    if not n in fact.mem:
        fact.mem[n] = n * fact(n - 1)
    return fact.mem[n]

回答 5

我发现这非常有用

def memoize(function):
    from functools import wraps

    memo = {}

    @wraps(function)
    def wrapper(*args):
        if args in memo:
            return memo[args]
        else:
            rv = function(*args)
            memo[args] = rv
            return rv
    return wrapper


@memoize
def fibonacci(n):
    if n < 2: return n
    return fibonacci(n - 1) + fibonacci(n - 2)

fibonacci(25)

I’ve found this extremely useful

def memoize(function):
    from functools import wraps

    memo = {}

    @wraps(function)
    def wrapper(*args):
        if args in memo:
            return memo[args]
        else:
            rv = function(*args)
            memo[args] = rv
            return rv
    return wrapper


@memoize
def fibonacci(n):
    if n < 2: return n
    return fibonacci(n - 1) + fibonacci(n - 2)

fibonacci(25)

回答 6

记忆基本上是保存过去使用递归算法完成的操作的结果,以便减少在以后需要相同计算时遍历递归树的需求。

参见http://scriptbucket.wordpress.com/2012/12/11/introduction-to-memoization/

Python中的斐波那契记忆示例:

fibcache = {}
def fib(num):
    if num in fibcache:
        return fibcache[num]
    else:
        fibcache[num] = num if num < 2 else fib(num-1) + fib(num-2)
        return fibcache[num]

Memoization is basically saving the results of past operations done with recursive algorithms in order to reduce the need to traverse the recursion tree if the same calculation is required at a later stage.

see http://scriptbucket.wordpress.com/2012/12/11/introduction-to-memoization/

Fibonacci Memoization example in Python:

fibcache = {}
def fib(num):
    if num in fibcache:
        return fibcache[num]
    else:
        fibcache[num] = num if num < 2 else fib(num-1) + fib(num-2)
        return fibcache[num]

回答 7

记忆化是将功能转换为数据结构。通常,人们希望转换是渐进地和惰性地进行(根据给定的域元素或“键”的需求)。在惰性函数语言中,这种惰性转换可以自动发生,因此可以实现备忘录而没有(显式)副作用。

Memoization is the conversion of functions into data structures. Usually one wants the conversion to occur incrementally and lazily (on demand of a given domain element–or “key”). In lazy functional languages, this lazy conversion can happen automatically, and thus memoization can be implemented without (explicit) side-effects.


回答 8

好吧,我应该首先回答第一部分:什么是记忆?

这只是以时间交换内存的一种方法。想想乘法表

在Python中使用可变对象作为默认值通常被认为是不好的。但是,如果明智地使用它,则实际上可以实现memoization

这是一个改编自http://docs.python.org/2/faq/design.html#why-are-default-values-shared-between-objects的示例

使用dict函数定义中的可变变量,可以缓存中间计算结果(例如,在计算factorial(10)后进行计算时factorial(9),我们可以重用所有中间结果)

def factorial(n, _cache={1:1}):    
    try:            
        return _cache[n]           
    except IndexError:
        _cache[n] = factorial(n-1)*n
        return _cache[n]

Well I should answer the first part first: what’s memoization?

It’s just a method to trade memory for time. Think of Multiplication Table.

Using mutable object as default value in Python is usually considered bad. But if use it wisely, it can actually be useful to implement a memoization.

Here’s an example adapted from http://docs.python.org/2/faq/design.html#why-are-default-values-shared-between-objects

Using a mutable dict in the function definition, the intermediate computed results can be cached (e.g. when calculating factorial(10) after calculate factorial(9), we can reuse all the intermediate results)

def factorial(n, _cache={1:1}):    
    try:            
        return _cache[n]           
    except IndexError:
        _cache[n] = factorial(n-1)*n
        return _cache[n]

回答 9

这是一个可以使用列表或字典类型参数而无需抱怨的解决方案:

def memoize(fn):
    """returns a memoized version of any function that can be called
    with the same list of arguments.
    Usage: foo = memoize(foo)"""

    def handle_item(x):
        if isinstance(x, dict):
            return make_tuple(sorted(x.items()))
        elif hasattr(x, '__iter__'):
            return make_tuple(x)
        else:
            return x

    def make_tuple(L):
        return tuple(handle_item(x) for x in L)

    def foo(*args, **kwargs):
        items_cache = make_tuple(sorted(kwargs.items()))
        args_cache = make_tuple(args)
        if (args_cache, items_cache) not in foo.past_calls:
            foo.past_calls[(args_cache, items_cache)] = fn(*args,**kwargs)
        return foo.past_calls[(args_cache, items_cache)]
    foo.past_calls = {}
    foo.__name__ = 'memoized_' + fn.__name__
    return foo

请注意,通过在handle_item中作为特殊情况实现自己的哈希函数,可以自然地将此方法扩展到任何对象。例如,要使该方法适用于将集合作为输入参数的函数,可以将其添加到handle_item中:

if is_instance(x, set):
    return make_tuple(sorted(list(x)))

Here is a solution that will work with list or dict type arguments without whining:

def memoize(fn):
    """returns a memoized version of any function that can be called
    with the same list of arguments.
    Usage: foo = memoize(foo)"""

    def handle_item(x):
        if isinstance(x, dict):
            return make_tuple(sorted(x.items()))
        elif hasattr(x, '__iter__'):
            return make_tuple(x)
        else:
            return x

    def make_tuple(L):
        return tuple(handle_item(x) for x in L)

    def foo(*args, **kwargs):
        items_cache = make_tuple(sorted(kwargs.items()))
        args_cache = make_tuple(args)
        if (args_cache, items_cache) not in foo.past_calls:
            foo.past_calls[(args_cache, items_cache)] = fn(*args,**kwargs)
        return foo.past_calls[(args_cache, items_cache)]
    foo.past_calls = {}
    foo.__name__ = 'memoized_' + fn.__name__
    return foo

Note that this approach can be naturally extended to any object by implementing your own hash function as a special case in handle_item. For example, to make this approach work for a function that takes a set as an input argument, you could add to handle_item:

if is_instance(x, set):
    return make_tuple(sorted(list(x)))

回答 10

与位置参数和关键字参数一起使用的解决方案,与关键字args的传递顺序无关(使用inspect.getargspec):

import inspect
import functools

def memoize(fn):
    cache = fn.cache = {}
    @functools.wraps(fn)
    def memoizer(*args, **kwargs):
        kwargs.update(dict(zip(inspect.getargspec(fn).args, args)))
        key = tuple(kwargs.get(k, None) for k in inspect.getargspec(fn).args)
        if key not in cache:
            cache[key] = fn(**kwargs)
        return cache[key]
    return memoizer

相似的问题:确定等效的varargs函数调用以在Python中进行记忆

Solution that works with both positional and keyword arguments independently of order in which keyword args were passed (using inspect.getargspec):

import inspect
import functools

def memoize(fn):
    cache = fn.cache = {}
    @functools.wraps(fn)
    def memoizer(*args, **kwargs):
        kwargs.update(dict(zip(inspect.getargspec(fn).args, args)))
        key = tuple(kwargs.get(k, None) for k in inspect.getargspec(fn).args)
        if key not in cache:
            cache[key] = fn(**kwargs)
        return cache[key]
    return memoizer

Similar question: Identifying equivalent varargs function calls for memoization in Python


回答 11

cache = {}
def fib(n):
    if n <= 1:
        return n
    else:
        if n not in cache:
            cache[n] = fib(n-1) + fib(n-2)
        return cache[n]
cache = {}
def fib(n):
    if n <= 1:
        return n
    else:
        if n not in cache:
            cache[n] = fib(n-1) + fib(n-2)
        return cache[n]

回答 12

只是想添加到已经提供的答案中,Python装饰器库具有一些简单但有用的实现,与相比,它们还可以记住“无法散列的类型” functools.lru_cache

Just wanted to add to the answers already provided, the Python decorator library has some simple yet useful implementations that can also memoize “unhashable types”, unlike functools.lru_cache.


检查另一个字符串中是否存在多个字符串

问题:检查另一个字符串中是否存在多个字符串

如何检查数组中的任何字符串是否在另一个字符串中?

喜欢:

a = ['a', 'b', 'c']
str = "a123"
if a in str:
  print "some of the strings found in str"
else:
  print "no strings found in str"

该代码行不通,只是为了展示我想要实现的目标。

How can I check if any of the strings in an array exists in another string?

Like:

a = ['a', 'b', 'c']
str = "a123"
if a in str:
  print "some of the strings found in str"
else:
  print "no strings found in str"

That code doesn’t work, it’s just to show what I want to achieve.


回答 0

您可以使用any

a_string = "A string is more than its parts!"
matches = ["more", "wholesome", "milk"]

if any(x in a_string for x in matches):

同样检查 找到了列表中的所有字符串,请使用all代替any

You can use any:

a_string = "A string is more than its parts!"
matches = ["more", "wholesome", "milk"]

if any(x in a_string for x in matches):

Similarly to check if all the strings from the list are found, use all instead of any.


回答 1

any()到目前为止,如果您想要的只是TrueFalse,那么这是最好的方法,但是如果您想具体了解哪些字符串匹配,则可以使用一些方法。

如果要进行第一个匹配(False默认为):

match = next((x for x in a if x in str), False)

如果要获得所有匹配项(包括重复项):

matches = [x for x in a if x in str]

如果要获取所有非重复的匹配项(不考虑顺序):

matches = {x for x in a if x in str}

如果要以正确的顺序获取所有非重复的匹配项:

matches = []
for x in a:
    if x in str and x not in matches:
        matches.append(x)

any() is by far the best approach if all you want is True or False, but if you want to know specifically which string/strings match, you can use a couple things.

If you want the first match (with False as a default):

match = next((x for x in a if x in str), False)

If you want to get all matches (including duplicates):

matches = [x for x in a if x in str]

If you want to get all non-duplicate matches (disregarding order):

matches = {x for x in a if x in str}

If you want to get all non-duplicate matches in the right order:

matches = []
for x in a:
    if x in str and x not in matches:
        matches.append(x)

回答 2

如果输入的字符串变长astr变长,则应小心。简单的解采用O(S *(A ^ 2)),其中S是的长度,str而A是中的所有字符串的长度之和a。为获得更快的解决方案,请查看用于字符串匹配的Aho-Corasick算法,该算法以线性时间O(S + A)运行。

You should be careful if the strings in a or str gets longer. The straightforward solutions take O(S*(A^2)), where S is the length of str and A is the sum of the lenghts of all strings in a. For a faster solution, look at Aho-Corasick algorithm for string matching, which runs in linear time O(S+A).


回答 3

只是为了增加一些多样性regex

import re

if any(re.findall(r'a|b|c', str, re.IGNORECASE)):
    print 'possible matches thanks to regex'
else:
    print 'no matches'

或者如果您的清单太长- any(re.findall(r'|'.join(a), str, re.IGNORECASE))

Just to add some diversity with regex:

import re

if any(re.findall(r'a|b|c', str, re.IGNORECASE)):
    print 'possible matches thanks to regex'
else:
    print 'no matches'

or if your list is too long – any(re.findall(r'|'.join(a), str, re.IGNORECASE))


回答 4

您需要迭代a的元素。

a = ['a', 'b', 'c']
str = "a123"
found_a_string = False
for item in a:    
    if item in str:
        found_a_string = True

if found_a_string:
    print "found a match"
else:
    print "no match found"

You need to iterate on the elements of a.

a = ['a', 'b', 'c']
str = "a123"
found_a_string = False
for item in a:    
    if item in str:
        found_a_string = True

if found_a_string:
    print "found a match"
else:
    print "no match found"

回答 5

jbernadas已经提到Aho-Corasick-Algorithm,以降低复杂性。

这是在Python中使用它的一种方法:

  1. 这里下载aho_corasick.py

  2. 将其与主Python文件放在同一目录中并命名 aho_corasick.py

  3. 使用以下代码尝试算法:

    from aho_corasick import aho_corasick #(string, keywords)
    
    print(aho_corasick(string, ["keyword1", "keyword2"]))

请注意,搜索区分大小写

jbernadas already mentioned the Aho-Corasick-Algorithm in order to reduce complexity.

Here is one way to use it in Python:

  1. Download aho_corasick.py from here

  2. Put it in the same directory as your main Python file and name it aho_corasick.py

  3. Try the alrorithm with the following code:

    from aho_corasick import aho_corasick #(string, keywords)
    
    print(aho_corasick(string, ["keyword1", "keyword2"]))
    

Note that the search is case-sensitive


回答 6

a = ['a', 'b', 'c']
str =  "a123"

a_match = [True for match in a if match in str]

if True in a_match:
  print "some of the strings found in str"
else:
  print "no strings found in str"
a = ['a', 'b', 'c']
str =  "a123"

a_match = [True for match in a if match in str]

if True in a_match:
  print "some of the strings found in str"
else:
  print "no strings found in str"

回答 7

这取决于上下文猜,如果你要检查,如单文字(任何一个字,E,W,..等)足够

original_word ="hackerearcth"
for 'h' in original_word:
      print("YES")

如果要检查original_word中的任何字符:请使用

if any(your_required in yourinput for your_required in original_word ):

如果要在那个original_word中输入所有想要的输入,请使用所有简单的输入

original_word = ['h', 'a', 'c', 'k', 'e', 'r', 'e', 'a', 'r', 't', 'h']
yourinput = str(input()).lower()
if all(requested_word in yourinput for requested_word in original_word):
    print("yes")

It depends on the context suppose if you want to check single literal like(any single word a,e,w,..etc) in is enough

original_word ="hackerearcth"
for 'h' in original_word:
      print("YES")

if you want to check any of the character among the original_word: make use of

if any(your_required in yourinput for your_required in original_word ):

if you want all the input you want in that original_word,make use of all simple

original_word = ['h', 'a', 'c', 'k', 'e', 'r', 'e', 'a', 'r', 't', 'h']
yourinput = str(input()).lower()
if all(requested_word in yourinput for requested_word in original_word):
    print("yes")

回答 8

关于如何获取String中所有列表元素的更多信息

a = ['a', 'b', 'c']
str = "a123" 
list(filter(lambda x:  x in str, a))

Just some more info on how to get all list elements availlable in String

a = ['a', 'b', 'c']
str = "a123" 
list(filter(lambda x:  x in str, a))

回答 9

一种出奇的快速方法是使用set

a = ['a', 'b', 'c']
str = "a123"
if set(a) & set(str):
    print("some of the strings found in str")
else:
    print("no strings found in str")

如果a不包含任何多个字符的值(在这种情况下,请使用上面any列出的值),则此方法有效。如果是这样,这是简单的指定为字符串:。aa = 'abc'

A surprisingly fast approach is to use set:

a = ['a', 'b', 'c']
str = "a123"
if set(a) & set(str):
    print("some of the strings found in str")
else:
    print("no strings found in str")

This works if a does not contain any multiple-character values (in which case use any as listed above). If so, it’s simpler to specify a as a string: a = 'abc'.


回答 10

flog = open('test.txt', 'r')
flogLines = flog.readlines()
strlist = ['SUCCESS', 'Done','SUCCESSFUL']
res = False
for line in flogLines:
     for fstr in strlist:
         if line.find(fstr) != -1:
            print('found') 
            res = True


if res:
    print('res true')
else: 
    print('res false')

输出示例图像

flog = open('test.txt', 'r')
flogLines = flog.readlines()
strlist = ['SUCCESS', 'Done','SUCCESSFUL']
res = False
for line in flogLines:
     for fstr in strlist:
         if line.find(fstr) != -1:
            print('found') 
            res = True


if res:
    print('res true')
else: 
    print('res false')

output example image


回答 11

我会使用这种功能来提高速度:

def check_string(string, substring_list):
    for substring in substring_list:
        if substring in string:
            return True
    return False

I would use this kind of function for speed:

def check_string(string, substring_list):
    for substring in substring_list:
        if substring in string:
            return True
    return False

回答 12

data = "firstName and favoriteFood"
mandatory_fields = ['firstName', 'lastName', 'age']


# for each
for field in mandatory_fields:
    if field not in data:
        print("Error, missing req field {0}".format(field));

# still fine, multiple if statements
if ('firstName' not in data or 
    'lastName' not in data or
    'age' not in data):
    print("Error, missing a req field");

# not very readable, list comprehension
missing_fields = [x for x in mandatory_fields if x not in data]
if (len(missing_fields)>0):
    print("Error, missing fields {0}".format(", ".join(missing_fields)));
data = "firstName and favoriteFood"
mandatory_fields = ['firstName', 'lastName', 'age']


# for each
for field in mandatory_fields:
    if field not in data:
        print("Error, missing req field {0}".format(field));

# still fine, multiple if statements
if ('firstName' not in data or 
    'lastName' not in data or
    'age' not in data):
    print("Error, missing a req field");

# not very readable, list comprehension
missing_fields = [x for x in mandatory_fields if x not in data]
if (len(missing_fields)>0):
    print("Error, missing fields {0}".format(", ".join(missing_fields)));

从列表中删除所有出现的值?

问题:从列表中删除所有出现的值?

在Python中,remove()将删除列表中第一个出现的值。

如何从列表中删除所有出现的值?

这就是我的想法:

>>> remove_values_from_list([1, 2, 3, 4, 2, 2, 3], 2)
[1, 3, 4, 3]

In Python remove() will remove the first occurrence of value in a list.

How to remove all occurrences of a value from a list?

This is what I have in mind:

>>> remove_values_from_list([1, 2, 3, 4, 2, 2, 3], 2)
[1, 3, 4, 3]

回答 0

功能方法:

Python 3.x

>>> x = [1,2,3,2,2,2,3,4]
>>> list(filter((2).__ne__, x))
[1, 3, 3, 4]

要么

>>> x = [1,2,3,2,2,2,3,4]
>>> list(filter(lambda a: a != 2, x))
[1, 3, 3, 4]

Python 2.x

>>> x = [1,2,3,2,2,2,3,4]
>>> filter(lambda a: a != 2, x)
[1, 3, 3, 4]

Functional approach:

Python 3.x

>>> x = [1,2,3,2,2,2,3,4]
>>> list(filter((2).__ne__, x))
[1, 3, 3, 4]

or

>>> x = [1,2,3,2,2,2,3,4]
>>> list(filter(lambda a: a != 2, x))
[1, 3, 3, 4]

Python 2.x

>>> x = [1,2,3,2,2,2,3,4]
>>> filter(lambda a: a != 2, x)
[1, 3, 3, 4]

回答 1

您可以使用列表理解:

def remove_values_from_list(the_list, val):
   return [value for value in the_list if value != val]

x = [1, 2, 3, 4, 2, 2, 3]
x = remove_values_from_list(x, 2)
print x
# [1, 3, 4, 3]

You can use a list comprehension:

def remove_values_from_list(the_list, val):
   return [value for value in the_list if value != val]

x = [1, 2, 3, 4, 2, 2, 3]
x = remove_values_from_list(x, 2)
print x
# [1, 3, 4, 3]

回答 2

如果必须修改原始列表,则可以使用切片分配,同时仍使用有效的列表理解(或生成器表达式)。

>>> x = [1, 2, 3, 4, 2, 2, 3]
>>> x[:] = (value for value in x if value != 2)
>>> x
[1, 3, 4, 3]

You can use slice assignment if the original list must be modified, while still using an efficient list comprehension (or generator expression).

>>> x = [1, 2, 3, 4, 2, 2, 3]
>>> x[:] = (value for value in x if value != 2)
>>> x
[1, 3, 4, 3]

回答 3

以更抽象的方式重复第一篇文章的解决方案:

>>> x = [1, 2, 3, 4, 2, 2, 3]
>>> while 2 in x: x.remove(2)
>>> x
[1, 3, 4, 3]

Repeating the solution of the first post in a more abstract way:

>>> x = [1, 2, 3, 4, 2, 2, 3]
>>> while 2 in x: x.remove(2)
>>> x
[1, 3, 4, 3]

回答 4

查看简单的解决方案

>>> [i for i in x if i != 2]

这将返回具有所有元素的列表x,而不2

See the simple solution

>>> [i for i in x if i != 2]

This will return a list having all elements of x without 2


回答 5

上面的所有答案(除了Martin Andersson的答案)都会创建一个没有所需项目的新列表,而不是从原始列表中删除这些项目。

>>> import random, timeit
>>> a = list(range(5)) * 1000
>>> random.shuffle(a)

>>> b = a
>>> print(b is a)
True

>>> b = [x for x in b if x != 0]
>>> print(b is a)
False
>>> b.count(0)
0
>>> a.count(0)
1000

>>> b = a
>>> b = filter(lambda a: a != 2, x)
>>> print(b is a)
False

如果您对列表有其他参考,这可能很重要。

要修改列表,请使用如下方法

>>> def removeall_inplace(x, l):
...     for _ in xrange(l.count(x)):
...         l.remove(x)
...
>>> removeall_inplace(0, b)
>>> b is a
True
>>> a.count(0)
0

就速度而言,我的笔记本电脑上的结果是(全部在5000个条目列表中,其中删除了1000个条目)

  • 列表理解-〜400us
  • 过滤器-〜900us
  • .remove()循环-50毫秒

因此.remove循环要慢100倍左右……..嗯,也许需要另一种方法。我发现最快的方法是使用列表理解,然后替换原始列表的内容。

>>> def removeall_replace(x, l):
....    t = [y for y in l if y != x]
....    del l[:]
....    l.extend(t)
  • removeall_replace()-450us

All of the answers above (apart from Martin Andersson’s) create a new list without the desired items, rather than removing the items from the original list.

>>> import random, timeit
>>> a = list(range(5)) * 1000
>>> random.shuffle(a)

>>> b = a
>>> print(b is a)
True

>>> b = [x for x in b if x != 0]
>>> print(b is a)
False
>>> b.count(0)
0
>>> a.count(0)
1000

>>> b = a
>>> b = filter(lambda a: a != 2, x)
>>> print(b is a)
False

This can be important if you have other references to the list hanging around.

To modify the list in place, use a method like this

>>> def removeall_inplace(x, l):
...     for _ in xrange(l.count(x)):
...         l.remove(x)
...
>>> removeall_inplace(0, b)
>>> b is a
True
>>> a.count(0)
0

As far as speed is concerned, results on my laptop are (all on a 5000 entry list with 1000 entries removed)

  • List comprehension – ~400us
  • Filter – ~900us
  • .remove() loop – 50ms

So the .remove loop is about 100x slower…….. Hmmm, maybe a different approach is needed. The fastest I’ve found is using the list comprehension, but then replace the contents of the original list.

>>> def removeall_replace(x, l):
....    t = [y for y in l if y != x]
....    del l[:]
....    l.extend(t)
  • removeall_replace() – 450us

回答 6

你可以这样做

while 2 in x:   
    x.remove(2)

you can do this

while 2 in x:   
    x.remove(2)

回答 7

我以牺牲可读性为代价,认为此版本稍快一些,因为它不会迫使用户重新检查列表,因此无论如何,要做的删除操作完全相同:

x = [1, 2, 3, 4, 2, 2, 3]
def remove_values_from_list(the_list, val):
    for i in range(the_list.count(val)):
        the_list.remove(val)

remove_values_from_list(x, 2)

print(x)

At the cost of readability, I think this version is slightly faster as it doesn’t force the while to reexamine the list, thus doing exactly the same work remove has to do anyway:

x = [1, 2, 3, 4, 2, 2, 3]
def remove_values_from_list(the_list, val):
    for i in range(the_list.count(val)):
        the_list.remove(val)

remove_values_from_list(x, 2)

print(x)

回答 8

针对具有1.000.000元素的列表/数组的numpy方法和计时:

时间:

In [10]: a.shape
Out[10]: (1000000,)

In [13]: len(lst)
Out[13]: 1000000

In [18]: %timeit a[a != 2]
100 loops, best of 3: 2.94 ms per loop

In [19]: %timeit [x for x in lst if x != 2]
10 loops, best of 3: 79.7 ms per loop

结论: numpy比列表理解方法快27倍(在我的笔记本上)

PS,如果您想将常规Python列表转换lst为numpy数组:

arr = np.array(lst)

设定:

import numpy as np
a = np.random.randint(0, 1000, 10**6)

In [10]: a.shape
Out[10]: (1000000,)

In [12]: lst = a.tolist()

In [13]: len(lst)
Out[13]: 1000000

校验:

In [14]: a[a != 2].shape
Out[14]: (998949,)

In [15]: len([x for x in lst if x != 2])
Out[15]: 998949

Numpy approach and timings against a list/array with 1.000.000 elements:

Timings:

In [10]: a.shape
Out[10]: (1000000,)

In [13]: len(lst)
Out[13]: 1000000

In [18]: %timeit a[a != 2]
100 loops, best of 3: 2.94 ms per loop

In [19]: %timeit [x for x in lst if x != 2]
10 loops, best of 3: 79.7 ms per loop

Conclusion: numpy is 27 times faster (on my notebook) compared to list comprehension approach

PS if you want to convert your regular Python list lst to numpy array:

arr = np.array(lst)

Setup:

import numpy as np
a = np.random.randint(0, 1000, 10**6)

In [10]: a.shape
Out[10]: (1000000,)

In [12]: lst = a.tolist()

In [13]: len(lst)
Out[13]: 1000000

Check:

In [14]: a[a != 2].shape
Out[14]: (998949,)

In [15]: len([x for x in lst if x != 2])
Out[15]: 998949

回答 9

a = [1, 2, 2, 3, 1]
to_remove = 1
a = [i for i in a if i != to_remove]
print(a)

也许不是最pythonic,但对我来说仍然是最简单的哈哈

a = [1, 2, 2, 3, 1]
to_remove = 1
a = [i for i in a if i != to_remove]
print(a)

Perhaps not the most pythonic but still the easiest for me haha


回答 10

要删除所有重复的事件并将其留在列表中:

test = [1, 1, 2, 3]

newlist = list(set(test))

print newlist

[1, 2, 3]

这是我用于欧拉计画的功能:

def removeOccurrences(e):
  return list(set(e))

To remove all duplicate occurrences and leave one in the list:

test = [1, 1, 2, 3]

newlist = list(set(test))

print newlist

[1, 2, 3]

Here is the function I’ve used for Project Euler:

def removeOccurrences(e):
  return list(set(e))

回答 11

我相信,如果您不关心列表顺序,如果您确实关心最终顺序,则可以从原始位置存储索引并以此来求助,这可能比任何其他方式都快。

category_ids.sort()
ones_last_index = category_ids.count('1')
del category_ids[0:ones_last_index]

I believe this is probably faster than any other way if you don’t care about the lists order, if you do take care about the final order store the indexes from the original and resort by that.

category_ids.sort()
ones_last_index = category_ids.count('1')
del category_ids[0:ones_last_index]

回答 12

for i in range(a.count(' ')):
    a.remove(' ')

我相信要简单得多。

for i in range(a.count(' ')):
    a.remove(' ')

Much simpler I believe.


回答 13

>>> x = [1, 2, 3, 4, 2, 2, 3]

之前已经发布的最简单有效的解决方案是

>>> x[:] = [v for v in x if v != 2]
>>> x
[1, 3, 4, 3]

应该使用较少的内存但速度较慢的另一种可能性是

>>> for i in range(len(x) - 1, -1, -1):
        if x[i] == 2:
            x.pop(i)  # takes time ~ len(x) - i
>>> x
[1, 3, 4, 3]

长度为1000和100000且具有10%匹配项的列表的计时结果:0.16 vs 0.25 ms,23 vs 123 ms。

长度为1000的计时

长度为100000的时间

Let

>>> x = [1, 2, 3, 4, 2, 2, 3]

The simplest and efficient solution as already posted before is

>>> x[:] = [v for v in x if v != 2]
>>> x
[1, 3, 4, 3]

Another possibility which should use less memory but be slower is

>>> for i in range(len(x) - 1, -1, -1):
        if x[i] == 2:
            x.pop(i)  # takes time ~ len(x) - i
>>> x
[1, 3, 4, 3]

Timing results for lists of length 1000 and 100000 with 10% matching entries: 0.16 vs 0.25 ms, and 23 vs 123 ms.

Timing with length 1000

Timing with length 100000


回答 14

从Python列表中删除所有出现的值

lists = [6.9,7,8.9,3,5,4.9,1,2.9,7,9,12.9,10.9,11,7]
def remove_values_from_list():
    for list in lists:
      if(list!=7):
         print(list)
remove_values_from_list()

结果: 6.9 8.9 3 5 4.9 1 2.9 9 12.9 10.9 11

或者,

lists = [6.9,7,8.9,3,5,4.9,1,2.9,7,9,12.9,10.9,11,7]
def remove_values_from_list(remove):
    for list in lists:
      if(list!=remove):
        print(list)
remove_values_from_list(7)

结果: 6.9 8.9 3 5 4.9 1 2.9 9 12.9 10.9 11

Remove all occurrences of a value from a Python list

lists = [6.9,7,8.9,3,5,4.9,1,2.9,7,9,12.9,10.9,11,7]
def remove_values_from_list():
    for list in lists:
      if(list!=7):
         print(list)
remove_values_from_list()

Result: 6.9 8.9 3 5 4.9 1 2.9 9 12.9 10.9 11

Alternatively,

lists = [6.9,7,8.9,3,5,4.9,1,2.9,7,9,12.9,10.9,11,7]
def remove_values_from_list(remove):
    for list in lists:
      if(list!=remove):
        print(list)
remove_values_from_list(7)

Result: 6.9 8.9 3 5 4.9 1 2.9 9 12.9 10.9 11


回答 15

如果您没有内置的filter或不想使用多余的空间,并且需要线性解决方案…

def remove_all(A, v):
    k = 0
    n = len(A)
    for i in range(n):
        if A[i] !=  v:
            A[k] = A[i]
            k += 1

    A = A[:k]

If you didn’t have built-in filter or didn’t want to use extra space and you need a linear solution…

def remove_all(A, v):
    k = 0
    n = len(A)
    for i in range(n):
        if A[i] !=  v:
            A[k] = A[i]
            k += 1

    A = A[:k]

回答 16

hello =  ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
#chech every item for a match
for item in range(len(hello)-1):
     if hello[item] == ' ': 
#if there is a match, rebuild the list with the list before the item + the list after the item
         hello = hello[:item] + hello [item + 1:]
print hello

[‘你好,世界’]

hello =  ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
#chech every item for a match
for item in range(len(hello)-1):
     if hello[item] == ' ': 
#if there is a match, rebuild the list with the list before the item + the list after the item
         hello = hello[:item] + hello [item + 1:]
print hello

[‘h’, ‘e’, ‘l’, ‘l’, ‘o’, ‘w’, ‘o’, ‘r’, ‘l’, ‘d’]


回答 17

我只是这样做了。我只是一个初学者。稍微更高级的程序员肯定可以编写这样的函数。

for i in range(len(spam)):
    spam.remove('cat')
    if 'cat' not in spam:
         print('All instances of ' + 'cat ' + 'have been removed')
         break

I just did this for a list. I am just a beginner. A slightly more advanced programmer can surely write a function like this.

for i in range(len(spam)):
    spam.remove('cat')
    if 'cat' not in spam:
         print('All instances of ' + 'cat ' + 'have been removed')
         break

回答 18

我们也可以使用del或进行就地删除所有内容pop

import random

def remove_values_from_list(lst, target):
    if type(lst) != list:
        return lst

    i = 0
    while i < len(lst):
        if lst[i] == target:
            lst.pop(i)  # length decreased by 1 already
        else:
            i += 1

    return lst

remove_values_from_list(None, 2)
remove_values_from_list([], 2)
remove_values_from_list([1, 2, 3, 4, 2, 2, 3], 2)
lst = remove_values_from_list([random.randrange(0, 10) for x in range(1000000)], 2)
print(len(lst))

现在提高效率:

In [21]: %timeit -n1 -r1 x = random.randrange(0,10)
1 loop, best of 1: 43.5 us per loop

In [22]: %timeit -n1 -r1 lst = [random.randrange(0, 10) for x in range(1000000)]
g1 loop, best of 1: 660 ms per loop

In [23]: %timeit -n1 -r1 lst = remove_values_from_list([random.randrange(0, 10) for x in range(1000000)]
    ...: , random.randrange(0,10))
1 loop, best of 1: 11.5 s per loop

In [27]: %timeit -n1 -r1 x = random.randrange(0,10); lst = [a for a in [random.randrange(0, 10) for x in
    ...:  range(1000000)] if x != a]
1 loop, best of 1: 710 ms per loop

正如我们所见,就地版本remove_values_from_list()不需要任何额外的内存,但是运行起来确实需要更多的时间:

  • 11秒就地删除值
  • 710毫秒的列表解析时间,可在内存中分配新列表

We can also do in-place remove all using either del or pop:

import random

def remove_values_from_list(lst, target):
    if type(lst) != list:
        return lst

    i = 0
    while i < len(lst):
        if lst[i] == target:
            lst.pop(i)  # length decreased by 1 already
        else:
            i += 1

    return lst

remove_values_from_list(None, 2)
remove_values_from_list([], 2)
remove_values_from_list([1, 2, 3, 4, 2, 2, 3], 2)
lst = remove_values_from_list([random.randrange(0, 10) for x in range(1000000)], 2)
print(len(lst))


Now for the efficiency:

In [21]: %timeit -n1 -r1 x = random.randrange(0,10)
1 loop, best of 1: 43.5 us per loop

In [22]: %timeit -n1 -r1 lst = [random.randrange(0, 10) for x in range(1000000)]
g1 loop, best of 1: 660 ms per loop

In [23]: %timeit -n1 -r1 lst = remove_values_from_list([random.randrange(0, 10) for x in range(1000000)]
    ...: , random.randrange(0,10))
1 loop, best of 1: 11.5 s per loop

In [27]: %timeit -n1 -r1 x = random.randrange(0,10); lst = [a for a in [random.randrange(0, 10) for x in
    ...:  range(1000000)] if x != a]
1 loop, best of 1: 710 ms per loop

As we see that in-place version remove_values_from_list() does not require any extra memory, but it does take so much more time to run:

  • 11 seconds for inplace remove values
  • 710 milli seconds for list comprehensions, which allocates a new list in memory

回答 19

没有人发布关于时间和空间复杂性的最佳答案,所以我想我会试一试。这是一个解决方案,它消除了所有出现的特定值,而无需创建新数组,并且时间效率很高。缺点是元素不能保持顺序

时间复杂度:O(n)
附加空间复杂度:O(1)

def main():
    test_case([1, 2, 3, 4, 2, 2, 3], 2)     # [1, 3, 3, 4]
    test_case([3, 3, 3], 3)                 # []
    test_case([1, 1, 1], 3)                 # [1, 1, 1]


def test_case(test_val, remove_val):
    remove_element_in_place(test_val, remove_val)
    print(test_val)


def remove_element_in_place(my_list, remove_value):
    length_my_list = len(my_list)
    swap_idx = length_my_list - 1

    for idx in range(length_my_list - 1, -1, -1):
        if my_list[idx] == remove_value:
            my_list[idx], my_list[swap_idx] = my_list[swap_idx], my_list[idx]
            swap_idx -= 1

    for pop_idx in range(length_my_list - swap_idx - 1):
        my_list.pop() # O(1) operation


if __name__ == '__main__':
    main()

No one has posted an optimal answer for time and space complexity, so I thought I would give it a shot. Here is a solution that removes all occurrences of a specific value without creating a new array and at an efficient time complexity. The drawback is that the elements do not maintain order.

Time complexity: O(n)
Additional space complexity: O(1)

def main():
    test_case([1, 2, 3, 4, 2, 2, 3], 2)     # [1, 3, 3, 4]
    test_case([3, 3, 3], 3)                 # []
    test_case([1, 1, 1], 3)                 # [1, 1, 1]


def test_case(test_val, remove_val):
    remove_element_in_place(test_val, remove_val)
    print(test_val)


def remove_element_in_place(my_list, remove_value):
    length_my_list = len(my_list)
    swap_idx = length_my_list - 1

    for idx in range(length_my_list - 1, -1, -1):
        if my_list[idx] == remove_value:
            my_list[idx], my_list[swap_idx] = my_list[swap_idx], my_list[idx]
            swap_idx -= 1

    for pop_idx in range(length_my_list - swap_idx - 1):
        my_list.pop() # O(1) operation


if __name__ == '__main__':
    main()

回答 20

关于速度!

import time
s_time = time.time()

print 'start'
a = range(100000000)
del a[:]
print 'finished in %0.2f' % (time.time() - s_time)
# start
# finished in 3.25

s_time = time.time()
print 'start'
a = range(100000000)
a = []
print 'finished in %0.2f' % (time.time() - s_time)
# start
# finished in 2.11

About the speed!

import time
s_time = time.time()

print 'start'
a = range(100000000)
del a[:]
print 'finished in %0.2f' % (time.time() - s_time)
# start
# finished in 3.25

s_time = time.time()
print 'start'
a = range(100000000)
a = []
print 'finished in %0.2f' % (time.time() - s_time)
# start
# finished in 2.11

回答 21

p=[2,3,4,4,4]
p.clear()
print(p)
[]

仅适用于Python 3

p=[2,3,4,4,4]
p.clear()
print(p)
[]

Only with Python 3


回答 22

有什么问题:

Motor=['1','2','2']
For i in Motor:
       If i  != '2':
       Print(i)
Print(motor)

使用水蟒

What’s wrong with:

Motor=['1','2','2']
For i in Motor:
       If i  != '2':
       Print(i)
Print(motor)

Using anaconda


如何查找列表中所有出现的元素?

问题:如何查找列表中所有出现的元素?

index()只会给出列表中第一个出现的项目。有没有整齐的技巧可以返回列表中的所有索引?

index() will just give the first occurrence of an item in a list. Is there a neat trick which returns all indices in a list?


回答 0

您可以使用列表理解:

indices = [i for i, x in enumerate(my_list) if x == "whatever"]

You can use a list comprehension:

indices = [i for i, x in enumerate(my_list) if x == "whatever"]

回答 1

虽然不是直接解决列表问题的方法,但numpy对于此类事情确实很有帮助:

import numpy as np
values = np.array([1,2,3,1,2,4,5,6,3,2,1])
searchval = 3
ii = np.where(values == searchval)[0]

返回:

ii ==>array([2, 8])

与其他一些解决方案相比,这对于包含大量元素的列表(数组)而言可能会更快。

While not a solution for lists directly, numpy really shines for this sort of thing:

import numpy as np
values = np.array([1,2,3,1,2,4,5,6,3,2,1])
searchval = 3
ii = np.where(values == searchval)[0]

returns:

ii ==>array([2, 8])

This can be significantly faster for lists (arrays) with a large number of elements vs some of the other solutions.


回答 2

使用list.index以下解决方案:

def indices(lst, element):
    result = []
    offset = -1
    while True:
        try:
            offset = lst.index(element, offset+1)
        except ValueError:
            return result
        result.append(offset)

enumerate对于大型列表,它比使用的列表理解要快得多。如果您已经拥有阵列,它也比numpy解决方案要慢得多,否则转换的成本将超过速度增益(在具有100、1000和10000个元素的整数列表上进行测试)。

注意:基于Chris_Rands的注释的注意事项:如果结果足够稀疏,但是如果列表中有许多正在搜索的元素实例(大于列表的15%),则此解决方案比列表理解要快。 ,在包含1000个整数列表的测试中),列表理解速度更快。

A solution using list.index:

def indices(lst, element):
    result = []
    offset = -1
    while True:
        try:
            offset = lst.index(element, offset+1)
        except ValueError:
            return result
        result.append(offset)

It’s much faster than the list comprehension with enumerate, for large lists. It is also much slower than the numpy solution if you already have the array, otherwise the cost of converting outweighs the speed gain (tested on integer lists with 100, 1000 and 10000 elements).

NOTE: A note of caution based on Chris_Rands’ comment: this solution is faster than the list comprehension if the results are sufficiently sparse, but if the list has many instances of the element that is being searched (more than ~15% of the list, on a test with a list of 1000 integers), the list comprehension is faster.


回答 3

怎么样:

In [1]: l=[1,2,3,4,3,2,5,6,7]

In [2]: [i for i,val in enumerate(l) if val==3]
Out[2]: [2, 4]

How about:

In [1]: l=[1,2,3,4,3,2,5,6,7]

In [2]: [i for i,val in enumerate(l) if val==3]
Out[2]: [2, 4]

回答 4

occurrences = lambda s, lst: (i for i,e in enumerate(lst) if e == s)
list(occurrences(1, [1,2,3,1])) # = [0, 3]
occurrences = lambda s, lst: (i for i,e in enumerate(lst) if e == s)
list(occurrences(1, [1,2,3,1])) # = [0, 3]

回答 5

more_itertools.locate 查找满足条件的所有项目的索引。

from more_itertools import locate


list(locate([0, 1, 1, 0, 1, 0, 0]))
# [1, 2, 4]

list(locate(['a', 'b', 'c', 'b'], lambda x: x == 'b'))
# [1, 3]

more_itertools是第三方库> pip install more_itertools

more_itertools.locate finds indices for all items that satisfy a condition.

from more_itertools import locate


list(locate([0, 1, 1, 0, 1, 0, 0]))
# [1, 2, 4]

list(locate(['a', 'b', 'c', 'b'], lambda x: x == 'b'))
# [1, 3]

more_itertools is a third-party library > pip install more_itertools.


回答 6

针对所有情况的另一种解决方案(对不起,如果重复的话):

values = [1,2,3,1,2,4,5,6,3,2,1]
map(lambda val: (val, [i for i in xrange(len(values)) if values[i] == val]), values)

One more solution(sorry if duplicates) for all occurrences:

values = [1,2,3,1,2,4,5,6,3,2,1]
map(lambda val: (val, [i for i in xrange(len(values)) if values[i] == val]), values)

回答 7

或使用range(python 3):

l=[i for i in range(len(lst)) if lst[i]=='something...']

对于(python 2):

l=[i for i in xrange(len(lst)) if lst[i]=='something...']

然后(两种情况):

print(l)

符合预期。

Or Use range (python 3):

l=[i for i in range(len(lst)) if lst[i]=='something...']

For (python 2):

l=[i for i in xrange(len(lst)) if lst[i]=='something...']

And then (both cases):

print(l)

Is as expected.


回答 8

在python2中使用filter()。

>>> q = ['Yeehaw', 'Yeehaw', 'Googol', 'B9', 'Googol', 'NSM', 'B9', 'NSM', 'Dont Ask', 'Googol']
>>> filter(lambda i: q[i]=="Googol", range(len(q)))
[2, 4, 9]

Using filter() in python2.

>>> q = ['Yeehaw', 'Yeehaw', 'Googol', 'B9', 'Googol', 'NSM', 'B9', 'NSM', 'Dont Ask', 'Googol']
>>> filter(lambda i: q[i]=="Googol", range(len(q)))
[2, 4, 9]

回答 9

您可以创建一个defaultdict

from collections import defaultdict
d1 = defaultdict(int)      # defaults to 0 values for keys
unq = set(lst1)              # lst1 = [1, 2, 2, 3, 4, 1, 2, 7]
for each in unq:
      d1[each] = lst1.count(each)
else:
      print(d1)

You can create a defaultdict

from collections import defaultdict
d1 = defaultdict(int)      # defaults to 0 values for keys
unq = set(lst1)              # lst1 = [1, 2, 2, 3, 4, 1, 2, 7]
for each in unq:
      d1[each] = lst1.count(each)
else:
      print(d1)

回答 10

获取列表中一个或多个(相同)项目的所有出现次数和位置

使用enumerate(alist)可以存储第一个元素(n),即元素x等于要查找的内容时列表的索引。

>>> alist = ['foo', 'spam', 'egg', 'foo']
>>> foo_indexes = [n for n,x in enumerate(alist) if x=='foo']
>>> foo_indexes
[0, 3]
>>>

让我们使函数findindex

该函数将项目和列表作为参数,并返回项目在列表中的位置,就像我们之前看到的那样。

def indexlist(item2find, list_or_string):
  "Returns all indexes of an item in a list or a string"
  return [n for n,item in enumerate(list_or_string) if item==item2find]

print(indexlist("1", "010101010"))

输出量


[1, 3, 5, 7]

简单

for n, i in enumerate([1, 2, 3, 4, 1]):
    if i == 1:
        print(n)

输出:

0
4

Getting all the occurrences and the position of one or more (identical) items in a list

With enumerate(alist) you can store the first element (n) that is the index of the list when the element x is equal to what you look for.

>>> alist = ['foo', 'spam', 'egg', 'foo']
>>> foo_indexes = [n for n,x in enumerate(alist) if x=='foo']
>>> foo_indexes
[0, 3]
>>>

Let’s make our function findindex

This function takes the item and the list as arguments and return the position of the item in the list, like we saw before.

def indexlist(item2find, list_or_string):
  "Returns all indexes of an item in a list or a string"
  return [n for n,item in enumerate(list_or_string) if item==item2find]

print(indexlist("1", "010101010"))

Output


[1, 3, 5, 7]

Simple

for n, i in enumerate([1, 2, 3, 4, 1]):
    if i == 1:
        print(n)

Output:

0
4

回答 11

使用for-loop

  • 带有enumerate列表理解的答案更高效,更pythonic,但是,此答案针对的是可能不允许使用某些内置函数的学生
  • 创建一个空列表, indices
  • 使用创建循环for i in range(len(x)):,从本质上遍历索引位置列表[0, 1, 2, 3, ..., len(x)-1]
  • 在循环中,任何添加i,这里x[i]是一个比赛value,以indices
def get_indices(x: list, value: int) -> list:
    indices = list()
    for i in range(len(x)):
        if x[i] == value:
            indices.append(i)
    return indices

n = [1, 2, 3, -50, -60, 0, 6, 9, -60, -60]
print(get_indices(n, -60))

>>> [4, 8, 9]
  • 功能get_indices带有类型提示。在这种情况下,列表s n是一堆int,因此我们搜索value,也定义为int

使用while-loop.index

  • 使用.indextry-except用于错误处理,因为ValueError如果value不在列表中,则会出现a 。
def get_indices(x: list, value: int) -> list:
    indices = list()
    i = 0
    while True:
        try:
            # find an occurrence of value and update i to that index
            i = x.index(value, i)
            # add i to the list
            indices.append(i)
            # advance i by 1
            i += 1
        except ValueError as e:
            break
    return indices

print(get_indices(n, -60))
>>> [4, 8, 9]

Using a for-loop:

  • Answers with enumerate and a list comprehension are more efficient and pythonic, however, this answer is aimed at students who may not be allowed to use some of those built-in functions.
  • create an empty list, indices
  • create the loop with for i in range(len(x)):, which essentially iterates through a list of index locations [0, 1, 2, 3, ..., len(x)-1]
  • in the loop, add any i, where x[i] is a match to value, to indices
def get_indices(x: list, value: int) -> list:
    indices = list()
    for i in range(len(x)):
        if x[i] == value:
            indices.append(i)
    return indices

n = [1, 2, 3, -50, -60, 0, 6, 9, -60, -60]
print(get_indices(n, -60))

>>> [4, 8, 9]
  • The functions, get_indices, are implemented with type hints. In this case, the list, n, is a bunch of ints, therefore we search for value, also defined as an int.

Using a while-loop and .index:

  • With .index, use try-except for error handling because a ValueError will occur if value is not in the list.
def get_indices(x: list, value: int) -> list:
    indices = list()
    i = 0
    while True:
        try:
            # find an occurrence of value and update i to that index
            i = x.index(value, i)
            # add i to the list
            indices.append(i)
            # advance i by 1
            i += 1
        except ValueError as e:
            break
    return indices

print(get_indices(n, -60))
>>> [4, 8, 9]

回答 12

如果您使用的是Python 2,则可以通过以下方式实现相同的功能:

f = lambda my_list, value:filter(lambda x: my_list[x] == value, range(len(my_list)))

my_list您要获取其索引的列表在哪里,是要value搜索的值。用法:

f(some_list, some_element)

If you are using Python 2, you can achieve the same functionality with this:

f = lambda my_list, value:filter(lambda x: my_list[x] == value, range(len(my_list)))

Where my_list is the list you want to get the indexes of, and value is the value searched. Usage:

f(some_list, some_element)

回答 13

如果需要搜索某些索引之间所有元素的位置,则可以声明它们:

[i for i,x in enumerate([1,2,3,2]) if x==2 & 2<= i <=3] # -> [3]

If you need to search for all element’s positions between certain indices, you can state them:

[i for i,x in enumerate([1,2,3,2]) if x==2 & 2<= i <=3] # -> [3]

读取二进制文件并遍历每个字节

问题:读取二进制文件并遍历每个字节

在Python中,如何读取二进制文件并在该文件的每个字节上循环?

In Python, how do I read in a binary file and loop over each byte of that file?


回答 0

Python 2.4及更早版本

f = open("myfile", "rb")
try:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)
finally:
    f.close()

Python 2.5-2.7

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)

请注意,with语句在2.5以下的Python版本中不可用。要在v 2.5中使用它,您需要导入它:

from __future__ import with_statement

在2.6中是不需要的。

Python 3

在Python 3中,这有点不同。我们将不再以字节模式而是字节对象从流中获取原始字符,因此我们需要更改条件:

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte != b"":
        # Do stuff with byte.
        byte = f.read(1)

或如benhoyt所说,跳过不相等并利用b""评估为false 的事实。这使代码在2.6和3.x之间兼容,而无需进行任何更改。如果从字节模式更改为文本模式或相反,也可以避免更改条件。

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte:
        # Do stuff with byte.
        byte = f.read(1)

python 3.8

从现在开始,借助:=运算符,可以以更短的方式编写以上代码。

with open("myfile", "rb") as f:
    while (byte := f.read(1)):
        # Do stuff with byte.

Python 2.4 and Earlier

f = open("myfile", "rb")
try:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)
finally:
    f.close()

Python 2.5-2.7

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)

Note that the with statement is not available in versions of Python below 2.5. To use it in v 2.5 you’ll need to import it:

from __future__ import with_statement

In 2.6 this is not needed.

Python 3

In Python 3, it’s a bit different. We will no longer get raw characters from the stream in byte mode but byte objects, thus we need to alter the condition:

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte != b"":
        # Do stuff with byte.
        byte = f.read(1)

Or as benhoyt says, skip the not equal and take advantage of the fact that b"" evaluates to false. This makes the code compatible between 2.6 and 3.x without any changes. It would also save you from changing the condition if you go from byte mode to text or the reverse.

with open("myfile", "rb") as f:
    byte = f.read(1)
    while byte:
        # Do stuff with byte.
        byte = f.read(1)

python 3.8

From now on thanks to := operator the above code can be written in a shorter way.

with open("myfile", "rb") as f:
    while (byte := f.read(1)):
        # Do stuff with byte.

回答 1

该生成器从文件中产生字节,并分块读取文件:

def bytes_from_file(filename, chunksize=8192):
    with open(filename, "rb") as f:
        while True:
            chunk = f.read(chunksize)
            if chunk:
                for b in chunk:
                    yield b
            else:
                break

# example:
for b in bytes_from_file('filename'):
    do_stuff_with(b)

有关迭代器生成器的信息,请参见Python文档。

This generator yields bytes from a file, reading the file in chunks:

def bytes_from_file(filename, chunksize=8192):
    with open(filename, "rb") as f:
        while True:
            chunk = f.read(chunksize)
            if chunk:
                for b in chunk:
                    yield b
            else:
                break

# example:
for b in bytes_from_file('filename'):
    do_stuff_with(b)

See the Python documentation for information on iterators and generators.


回答 2

如果文件不是太大,则将其保存在内存中是一个问题:

with open("filename", "rb") as f:
    bytes_read = f.read()
for b in bytes_read:
    process_byte(b)

其中process_byte表示要对传入的字节执行的某些操作。

如果要一次处理一个块:

with open("filename", "rb") as f:
    bytes_read = f.read(CHUNKSIZE)
    while bytes_read:
        for b in bytes_read:
            process_byte(b)
        bytes_read = f.read(CHUNKSIZE)

with语句在Python 2.5及更高版本中可用。

If the file is not too big that holding it in memory is a problem:

with open("filename", "rb") as f:
    bytes_read = f.read()
for b in bytes_read:
    process_byte(b)

where process_byte represents some operation you want to perform on the passed-in byte.

If you want to process a chunk at a time:

with open("filename", "rb") as f:
    bytes_read = f.read(CHUNKSIZE)
    while bytes_read:
        for b in bytes_read:
            process_byte(b)
        bytes_read = f.read(CHUNKSIZE)

The with statement is available in Python 2.5 and greater.


回答 3

要读取一个文件(一次一个字节(忽略缓冲)),可以使用内置两个参数iter(callable, sentinel)的函数

with open(filename, 'rb') as file:
    for byte in iter(lambda: file.read(1), b''):
        # Do stuff with byte

它调用file.read(1)直到不返回任何内容b''(空字节串)。对于大文件,内存不会无限增长。您可以传递buffering=0open(),以禁用缓冲-它确保每次迭代仅读取一个字节(慢速)。

with-statement自动关闭文件-包括下面的代码引发异常的情况。

尽管默认情况下存在内部缓冲,但是每次处理一个字节仍然效率低下。例如,以下是该blackhole.py实用程序,它消耗了给出的所有内容:

#!/usr/bin/env python3
"""Discard all input. `cat > /dev/null` analog."""
import sys
from functools import partial
from collections import deque

chunksize = int(sys.argv[1]) if len(sys.argv) > 1 else (1 << 15)
deque(iter(partial(sys.stdin.detach().read, chunksize), b''), maxlen=0)

例:

$ dd if=/dev/zero bs=1M count=1000 | python3 blackhole.py

它处理〜1.5 Gb / s的时候chunksize == 32768我的机器,只有在〜7.5 MB / s的时候chunksize == 1。也就是说,一次读取一个字节要慢200倍。考虑到这一点,如果你可以重写你的处理同时使用多个字节,如果你需要的性能。

mmap允许您同时将文件视为bytearray和文件对象。如果您需要访问两个接口,它可以作为将整个文件加载到内存中的替代方法。特别是,您可以仅使用普通for-loop 一次遍历一个内存映射文件一个字节:

from mmap import ACCESS_READ, mmap

with open(filename, 'rb', 0) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
    for byte in s: # length is equal to the current file size
        # Do stuff with byte

mmap支持切片符号。例如,从文件中从position开始mm[i:i+len]返回len字节i。Python 3.2之前不支持上下文管理器协议。mm.close()在这种情况下,您需要显式调用。使用遍历每个字节mmap比消耗更多的内存file.read(1),但mmap速度要快一个数量级。

To read a file — one byte at a time (ignoring the buffering) — you could use the two-argument iter(callable, sentinel) built-in function:

with open(filename, 'rb') as file:
    for byte in iter(lambda: file.read(1), b''):
        # Do stuff with byte

It calls file.read(1) until it returns nothing b'' (empty bytestring). The memory doesn’t grow unlimited for large files. You could pass buffering=0 to open(), to disable the buffering — it guarantees that only one byte is read per iteration (slow).

with-statement closes the file automatically — including the case when the code underneath raises an exception.

Despite the presence of internal buffering by default, it is still inefficient to process one byte at a time. For example, here’s the blackhole.py utility that eats everything it is given:

#!/usr/bin/env python3
"""Discard all input. `cat > /dev/null` analog."""
import sys
from functools import partial
from collections import deque

chunksize = int(sys.argv[1]) if len(sys.argv) > 1 else (1 << 15)
deque(iter(partial(sys.stdin.detach().read, chunksize), b''), maxlen=0)

Example:

$ dd if=/dev/zero bs=1M count=1000 | python3 blackhole.py

It processes ~1.5 GB/s when chunksize == 32768 on my machine and only ~7.5 MB/s when chunksize == 1. That is, it is 200 times slower to read one byte at a time. Take it into account if you can rewrite your processing to use more than one byte at a time and if you need performance.

mmap allows you to treat a file as a bytearray and a file object simultaneously. It can serve as an alternative to loading the whole file in memory if you need access both interfaces. In particular, you can iterate one byte at a time over a memory-mapped file just using a plain for-loop:

from mmap import ACCESS_READ, mmap

with open(filename, 'rb', 0) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
    for byte in s: # length is equal to the current file size
        # Do stuff with byte

mmap supports the slice notation. For example, mm[i:i+len] returns len bytes from the file starting at position i. The context manager protocol is not supported before Python 3.2; you need to call mm.close() explicitly in this case. Iterating over each byte using mmap consumes more memory than file.read(1), but mmap is an order of magnitude faster.


回答 4

用Python读取二进制文件并遍历每个字节

Python 3.5中的新功能是 pathlib模块,它具有一种便捷的方法,专门用于将文件读取为字节,从而允许我们遍历字节。我认为这是一个不错的回答(如果又快又脏):

import pathlib

for byte in pathlib.Path(path).read_bytes():
    print(byte)

有趣的是,这是唯一要提及的答案pathlib

在Python 2中,您可能会这样做(正如Vinay Sajip也建议的那样):

with open(path, 'b') as file:
    for byte in file.read():
        print(byte)

如果文件太大而无法在内存中进行迭代,则可以使用iter带有callable, sentinel签名的函数(Python 2版本)来惯用地对其进行分块:

with open(path, 'b') as file:
    callable = lambda: file.read(1024)
    sentinel = bytes() # or b''
    for chunk in iter(callable, sentinel): 
        for byte in chunk:
            print(byte)

(其他几个答案都提到了这一点,但很少有人提供合理的读取大小。)

大文件或缓冲/交互式读取的最佳实践

让我们创建一个函数来做到这一点,包括对Python 3.5+标准库的惯用用法:

from pathlib import Path
from functools import partial
from io import DEFAULT_BUFFER_SIZE

def file_byte_iterator(path):
    """given a path, return an iterator over the file
    that lazily loads the file
    """
    path = Path(path)
    with path.open('rb') as file:
        reader = partial(file.read1, DEFAULT_BUFFER_SIZE)
        file_iterator = iter(reader, bytes())
        for chunk in file_iterator:
            yield from chunk

请注意,我们使用file.read1file.read阻塞,直到获得所有请求的字节或EOFfile.read1让我们避免阻塞,因此它可以更快地返回。没有其他答案也提到这一点。

最佳实践演示:

让我们制作一个具有兆字节(实际上是兆字节)的伪随机数据的文件:

import random
import pathlib
path = 'pseudorandom_bytes'
pathobj = pathlib.Path(path)

pathobj.write_bytes(
  bytes(random.randint(0, 255) for _ in range(2**20)))

现在让我们对其进行迭代并在内存中实现它:

>>> l = list(file_byte_iterator(path))
>>> len(l)
1048576

我们可以检查数据的任何部分,例如,最后100个字节和前100个字节:

>>> l[-100:]
[208, 5, 156, 186, 58, 107, 24, 12, 75, 15, 1, 252, 216, 183, 235, 6, 136, 50, 222, 218, 7, 65, 234, 129, 240, 195, 165, 215, 245, 201, 222, 95, 87, 71, 232, 235, 36, 224, 190, 185, 12, 40, 131, 54, 79, 93, 210, 6, 154, 184, 82, 222, 80, 141, 117, 110, 254, 82, 29, 166, 91, 42, 232, 72, 231, 235, 33, 180, 238, 29, 61, 250, 38, 86, 120, 38, 49, 141, 17, 190, 191, 107, 95, 223, 222, 162, 116, 153, 232, 85, 100, 97, 41, 61, 219, 233, 237, 55, 246, 181]
>>> l[:100]
[28, 172, 79, 126, 36, 99, 103, 191, 146, 225, 24, 48, 113, 187, 48, 185, 31, 142, 216, 187, 27, 146, 215, 61, 111, 218, 171, 4, 160, 250, 110, 51, 128, 106, 3, 10, 116, 123, 128, 31, 73, 152, 58, 49, 184, 223, 17, 176, 166, 195, 6, 35, 206, 206, 39, 231, 89, 249, 21, 112, 168, 4, 88, 169, 215, 132, 255, 168, 129, 127, 60, 252, 244, 160, 80, 155, 246, 147, 234, 227, 157, 137, 101, 84, 115, 103, 77, 44, 84, 134, 140, 77, 224, 176, 242, 254, 171, 115, 193, 29]

不要逐行迭代二进制文件

不要执行以下操作-这会拉出任意大小的块,直到到达换行符为止-当块太小且可能也太大时,太慢:

    with open(path, 'rb') as file:
        for chunk in file: # text newline iteration - not for bytes
            yield from chunk

上面的内容仅适用于语义上人类可读的文本文件(例如纯文本,代码,标记,降价等……本质上是ascii,utf,拉丁语等……已编码),您应该在不带'b'标志的情况下打开它们。

Reading binary file in Python and looping over each byte

New in Python 3.5 is the pathlib module, which has a convenience method specifically to read in a file as bytes, allowing us to iterate over the bytes. I consider this a decent (if quick and dirty) answer:

import pathlib

for byte in pathlib.Path(path).read_bytes():
    print(byte)

Interesting that this is the only answer to mention pathlib.

In Python 2, you probably would do this (as Vinay Sajip also suggests):

with open(path, 'b') as file:
    for byte in file.read():
        print(byte)

In the case that the file may be too large to iterate over in-memory, you would chunk it, idiomatically, using the iter function with the callable, sentinel signature – the Python 2 version:

with open(path, 'b') as file:
    callable = lambda: file.read(1024)
    sentinel = bytes() # or b''
    for chunk in iter(callable, sentinel): 
        for byte in chunk:
            print(byte)

(Several other answers mention this, but few offer a sensible read size.)

Best practice for large files or buffered/interactive reading

Let’s create a function to do this, including idiomatic uses of the standard library for Python 3.5+:

from pathlib import Path
from functools import partial
from io import DEFAULT_BUFFER_SIZE

def file_byte_iterator(path):
    """given a path, return an iterator over the file
    that lazily loads the file
    """
    path = Path(path)
    with path.open('rb') as file:
        reader = partial(file.read1, DEFAULT_BUFFER_SIZE)
        file_iterator = iter(reader, bytes())
        for chunk in file_iterator:
            yield from chunk

Note that we use file.read1. file.read blocks until it gets all the bytes requested of it or EOF. file.read1 allows us to avoid blocking, and it can return more quickly because of this. No other answers mention this as well.

Demonstration of best practice usage:

Let’s make a file with a megabyte (actually mebibyte) of pseudorandom data:

import random
import pathlib
path = 'pseudorandom_bytes'
pathobj = pathlib.Path(path)

pathobj.write_bytes(
  bytes(random.randint(0, 255) for _ in range(2**20)))

Now let’s iterate over it and materialize it in memory:

>>> l = list(file_byte_iterator(path))
>>> len(l)
1048576

We can inspect any part of the data, for example, the last 100 and first 100 bytes:

>>> l[-100:]
[208, 5, 156, 186, 58, 107, 24, 12, 75, 15, 1, 252, 216, 183, 235, 6, 136, 50, 222, 218, 7, 65, 234, 129, 240, 195, 165, 215, 245, 201, 222, 95, 87, 71, 232, 235, 36, 224, 190, 185, 12, 40, 131, 54, 79, 93, 210, 6, 154, 184, 82, 222, 80, 141, 117, 110, 254, 82, 29, 166, 91, 42, 232, 72, 231, 235, 33, 180, 238, 29, 61, 250, 38, 86, 120, 38, 49, 141, 17, 190, 191, 107, 95, 223, 222, 162, 116, 153, 232, 85, 100, 97, 41, 61, 219, 233, 237, 55, 246, 181]
>>> l[:100]
[28, 172, 79, 126, 36, 99, 103, 191, 146, 225, 24, 48, 113, 187, 48, 185, 31, 142, 216, 187, 27, 146, 215, 61, 111, 218, 171, 4, 160, 250, 110, 51, 128, 106, 3, 10, 116, 123, 128, 31, 73, 152, 58, 49, 184, 223, 17, 176, 166, 195, 6, 35, 206, 206, 39, 231, 89, 249, 21, 112, 168, 4, 88, 169, 215, 132, 255, 168, 129, 127, 60, 252, 244, 160, 80, 155, 246, 147, 234, 227, 157, 137, 101, 84, 115, 103, 77, 44, 84, 134, 140, 77, 224, 176, 242, 254, 171, 115, 193, 29]

Don’t iterate by lines for binary files

Don’t do the following – this pulls a chunk of arbitrary size until it gets to a newline character – too slow when the chunks are too small, and possibly too large as well:

    with open(path, 'rb') as file:
        for chunk in file: # text newline iteration - not for bytes
            yield from chunk

The above is only good for what are semantically human readable text files (like plain text, code, markup, markdown etc… essentially anything ascii, utf, latin, etc… encoded) that you should open without the 'b' flag.


回答 5

总结一下chrispy,Skurmedel,Ben Hoyt和Peter Hansen的所有要点,这将是一次处理一个字节的二进制文件的最佳解决方案:

with open("myfile", "rb") as f:
    while True:
        byte = f.read(1)
        if not byte:
            break
        do_stuff_with(ord(byte))

对于python 2.6及更高版本,因为:

  • 内部python缓冲区-无需读取块
  • 干式原理-不要重复读取行
  • with语句可确保关闭文件干净
  • 如果没有更多的字节(不是字节为零),则“ byte”的计算结果为false

或使用JF Sebastians解决方案提高速度

from functools import partial

with open(filename, 'rb') as file:
    for byte in iter(partial(file.read, 1), b''):
        # Do stuff with byte

或者,如果您希望将其用作生成器功能(如codeape所示):

def bytes_from_file(filename):
    with open(filename, "rb") as f:
        while True:
            byte = f.read(1)
            if not byte:
                break
            yield(ord(byte))

# example:
for b in bytes_from_file('filename'):
    do_stuff_with(b)

To sum up all the brilliant points of chrispy, Skurmedel, Ben Hoyt and Peter Hansen, this would be the optimal solution for processing a binary file one byte at a time:

with open("myfile", "rb") as f:
    while True:
        byte = f.read(1)
        if not byte:
            break
        do_stuff_with(ord(byte))

For python versions 2.6 and above, because:

  • python buffers internally – no need to read chunks
  • DRY principle – do not repeat the read line
  • with statement ensures a clean file close
  • ‘byte’ evaluates to false when there are no more bytes (not when a byte is zero)

Or use J. F. Sebastians solution for improved speed

from functools import partial

with open(filename, 'rb') as file:
    for byte in iter(partial(file.read, 1), b''):
        # Do stuff with byte

Or if you want it as a generator function like demonstrated by codeape:

def bytes_from_file(filename):
    with open(filename, "rb") as f:
        while True:
            byte = f.read(1)
            if not byte:
                break
            yield(ord(byte))

# example:
for b in bytes_from_file('filename'):
    do_stuff_with(b)

回答 6

Python 3,一次读取所有文件:

with open("filename", "rb") as binary_file:
    # Read the whole file at once
    data = binary_file.read()
    print(data)

您可以使用data变量迭代任何对象。

Python 3, read all of the file at once:

with open("filename", "rb") as binary_file:
    # Read the whole file at once
    data = binary_file.read()
    print(data)

You can iterate whatever you want using data variable.


回答 7

经过上述所有尝试并使用@Aaron Hall的回答后,在运行Window 10、8 Gb RAM和32位Python 3.5的计算机上,我收到了约90 Mb文件的内存错误。我被同事推荐使用numpy改用它,这样效果会很好。

到目前为止,读取整个二进制文件(我已经测试过)的最快方法是:

import numpy as np

file = "binary_file.bin"
data = np.fromfile(file, 'u1')

参考

到目前为止,速度比任何其他方法都要快。希望它能对某人有所帮助!

After trying all the above and using the answer from @Aaron Hall, I was getting memory errors for a ~90 Mb file on a computer running Window 10, 8 Gb RAM and Python 3.5 32-bit. I was recommended by a colleague to use numpy instead and it works wonders.

By far, the fastest to read an entire binary file (that I have tested) is:

import numpy as np

file = "binary_file.bin"
data = np.fromfile(file, 'u1')

Reference

Multitudes faster than any other methods so far. Hope it helps someone!


回答 8

如果要读取大量二进制数据,则可能需要考虑struct模块。它被记录为“在C和Python类型之间转换”,但是字节当然是字节,并且是否将它们创建为C类型并不重要。例如,如果您的二进制数据包含两个2个字节的整数和一个4个字节的整数,则可以按以下方式读取它们(示例取自struct文档):

>>> struct.unpack('hhl', b'\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)

与显式遍历文件内容相比,您可能会发现这更方便,更快或两者兼而有之。

If you have a lot of binary data to read, you might want to consider the struct module. It is documented as converting “between C and Python types”, but of course, bytes are bytes, and whether those were created as C types does not matter. For example, if your binary data contains two 2-byte integers and one 4-byte integer, you can read them as follows (example taken from struct documentation):

>>> struct.unpack('hhl', b'\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)

You might find this more convenient, faster, or both, than explicitly looping over the content of a file.


回答 9

这篇文章本身不是该问题的直接答案。相反,它是一个数据驱动的可扩展基准,可用于比较已发布到此问题的许多答案(以及在以后的更现代的Python版本中使用的新功能的变体),因此应该有助于确定哪个具有最佳性能。

在某些情况下,我已经修改了参考答案中的代码,以使其与基准框架兼容。

首先,以下是当前最新版本的Python 2和3的结果:

Fastest to slowest execution speeds with 32-bit Python 2.7.16
  numpy version 1.16.5
  Test file size: 1,024 KiB
  100 executions, best of 3 repetitions

1                  Tcll (array.array) :   3.8943 secs, rel speed   1.00x,   0.00% slower (262.95 KiB/sec)
2  Vinay Sajip (read all into memory) :   4.1164 secs, rel speed   1.06x,   5.71% slower (248.76 KiB/sec)
3            codeape + iter + partial :   4.1616 secs, rel speed   1.07x,   6.87% slower (246.06 KiB/sec)
4                             codeape :   4.1889 secs, rel speed   1.08x,   7.57% slower (244.46 KiB/sec)
5               Vinay Sajip (chunked) :   4.1977 secs, rel speed   1.08x,   7.79% slower (243.94 KiB/sec)
6           Aaron Hall (Py 2 version) :   4.2417 secs, rel speed   1.09x,   8.92% slower (241.41 KiB/sec)
7                     gerrit (struct) :   4.2561 secs, rel speed   1.09x,   9.29% slower (240.59 KiB/sec)
8                     Rick M. (numpy) :   8.1398 secs, rel speed   2.09x, 109.02% slower (125.80 KiB/sec)
9                           Skurmedel :  31.3264 secs, rel speed   8.04x, 704.42% slower ( 32.69 KiB/sec)

Benchmark runtime (min:sec) - 03:26

Fastest to slowest execution speeds with 32-bit Python 3.8.0
  numpy version 1.17.4
  Test file size: 1,024 KiB
  100 executions, best of 3 repetitions

1  Vinay Sajip + "yield from" + "walrus operator" :   3.5235 secs, rel speed   1.00x,   0.00% slower (290.62 KiB/sec)
2                       Aaron Hall + "yield from" :   3.5284 secs, rel speed   1.00x,   0.14% slower (290.22 KiB/sec)
3         codeape + iter + partial + "yield from" :   3.5303 secs, rel speed   1.00x,   0.19% slower (290.06 KiB/sec)
4                      Vinay Sajip + "yield from" :   3.5312 secs, rel speed   1.00x,   0.22% slower (289.99 KiB/sec)
5      codeape + "yield from" + "walrus operator" :   3.5370 secs, rel speed   1.00x,   0.38% slower (289.51 KiB/sec)
6                          codeape + "yield from" :   3.5390 secs, rel speed   1.00x,   0.44% slower (289.35 KiB/sec)
7                                      jfs (mmap) :   4.0612 secs, rel speed   1.15x,  15.26% slower (252.14 KiB/sec)
8              Vinay Sajip (read all into memory) :   4.5948 secs, rel speed   1.30x,  30.40% slower (222.86 KiB/sec)
9                        codeape + iter + partial :   4.5994 secs, rel speed   1.31x,  30.54% slower (222.64 KiB/sec)
10                                        codeape :   4.5995 secs, rel speed   1.31x,  30.54% slower (222.63 KiB/sec)
11                          Vinay Sajip (chunked) :   4.6110 secs, rel speed   1.31x,  30.87% slower (222.08 KiB/sec)
12                      Aaron Hall (Py 2 version) :   4.6292 secs, rel speed   1.31x,  31.38% slower (221.20 KiB/sec)
13                             Tcll (array.array) :   4.8627 secs, rel speed   1.38x,  38.01% slower (210.58 KiB/sec)
14                                gerrit (struct) :   5.0816 secs, rel speed   1.44x,  44.22% slower (201.51 KiB/sec)
15                 Rick M. (numpy) + "yield from" :  11.8084 secs, rel speed   3.35x, 235.13% slower ( 86.72 KiB/sec)
16                                      Skurmedel :  11.8806 secs, rel speed   3.37x, 237.18% slower ( 86.19 KiB/sec)
17                                Rick M. (numpy) :  13.3860 secs, rel speed   3.80x, 279.91% slower ( 76.50 KiB/sec)

Benchmark runtime (min:sec) - 04:47

我还使用更大的10 MiB测试文件(运行了将近一个小时)运行了它,并获得了与上面显示的结果相当的性能结果。

这是用于进行基准测试的代码:

from __future__ import print_function
import array
import atexit
from collections import deque, namedtuple
import io
from mmap import ACCESS_READ, mmap
import numpy as np
from operator import attrgetter
import os
import random
import struct
import sys
import tempfile
from textwrap import dedent
import time
import timeit
import traceback

try:
    xrange
except NameError:  # Python 3
    xrange = range


class KiB(int):
    """ KibiBytes - multiples of the byte units for quantities of information. """
    def __new__(self, value=0):
        return 1024*value


BIG_TEST_FILE = 1  # MiBs or 0 for a small file.
SML_TEST_FILE = KiB(64)
EXECUTIONS = 100  # Number of times each "algorithm" is executed per timing run.
TIMINGS = 3  # Number of timing runs.
CHUNK_SIZE = KiB(8)
if BIG_TEST_FILE:
    FILE_SIZE = KiB(1024) * BIG_TEST_FILE
else:
    FILE_SIZE = SML_TEST_FILE  # For quicker testing.

# Common setup for all algorithms -- prefixed to each algorithm's setup.
COMMON_SETUP = dedent("""
    # Make accessible in algorithms.
    from __main__ import array, deque, get_buffer_size, mmap, np, struct
    from __main__ import ACCESS_READ, CHUNK_SIZE, FILE_SIZE, TEMP_FILENAME
    from functools import partial
    try:
        xrange
    except NameError:  # Python 3
        xrange = range
""")


def get_buffer_size(path):
    """ Determine optimal buffer size for reading files. """
    st = os.stat(path)
    try:
        bufsize = st.st_blksize # Available on some Unix systems (like Linux)
    except AttributeError:
        bufsize = io.DEFAULT_BUFFER_SIZE
    return bufsize

# Utility primarily for use when embedding additional algorithms into benchmark.
VERIFY_NUM_READ = """
    # Verify generator reads correct number of bytes (assumes values are correct).
    bytes_read = sum(1 for _ in file_byte_iterator(TEMP_FILENAME))
    assert bytes_read == FILE_SIZE, \
           'Wrong number of bytes generated: got {:,} instead of {:,}'.format(
                bytes_read, FILE_SIZE)
"""

TIMING = namedtuple('TIMING', 'label, exec_time')

class Algorithm(namedtuple('CodeFragments', 'setup, test')):

    # Default timeit "stmt" code fragment.
    _TEST = """
        #for b in file_byte_iterator(TEMP_FILENAME):  # Loop over every byte.
        #    pass  # Do stuff with byte...
        deque(file_byte_iterator(TEMP_FILENAME), maxlen=0)  # Data sink.
    """

    # Must overload __new__ because (named)tuples are immutable.
    def __new__(cls, setup, test=None):
        """ Dedent (unindent) code fragment string arguments.
        Args:
          `setup` -- Code fragment that defines things used by `test` code.
                     In this case it should define a generator function named
                     `file_byte_iterator()` that will be passed that name of a test file
                     of binary data. This code is not timed.
          `test` -- Code fragment that uses things defined in `setup` code.
                    Defaults to _TEST. This is the code that's timed.
        """
        test =  cls._TEST if test is None else test  # Use default unless one is provided.

        # Uncomment to replace all performance tests with one that verifies the correct
        # number of bytes values are being generated by the file_byte_iterator function.
        #test = VERIFY_NUM_READ

        return tuple.__new__(cls, (dedent(setup), dedent(test)))


algorithms = {

    'Aaron Hall (Py 2 version)': Algorithm("""
        def file_byte_iterator(path):
            with open(path, "rb") as file:
                callable = partial(file.read, 1024)
                sentinel = bytes() # or b''
                for chunk in iter(callable, sentinel):
                    for byte in chunk:
                        yield byte
    """),

    "codeape": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                while True:
                    chunk = f.read(chunksize)
                    if chunk:
                        for b in chunk:
                            yield b
                    else:
                        break
    """),

    "codeape + iter + partial": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                for chunk in iter(partial(f.read, chunksize), b''):
                    for b in chunk:
                        yield b
    """),

    "gerrit (struct)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                fmt = '{}B'.format(FILE_SIZE)  # Reads entire file at once.
                for b in struct.unpack(fmt, f.read()):
                    yield b
    """),

    'Rick M. (numpy)': Algorithm("""
        def file_byte_iterator(filename):
            for byte in np.fromfile(filename, 'u1'):
                yield byte
    """),

    "Skurmedel": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                byte = f.read(1)
                while byte:
                    yield byte
                    byte = f.read(1)
    """),

    "Tcll (array.array)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                arr = array.array('B')
                arr.fromfile(f, FILE_SIZE)  # Reads entire file at once.
                for b in arr:
                    yield b
    """),

    "Vinay Sajip (read all into memory)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                bytes_read = f.read()  # Reads entire file at once.
            for b in bytes_read:
                yield b
    """),

    "Vinay Sajip (chunked)": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                chunk = f.read(chunksize)
                while chunk:
                    for b in chunk:
                        yield b
                    chunk = f.read(chunksize)
    """),

}  # End algorithms

#
# Versions of algorithms that will only work in certain releases (or better) of Python.
#
if sys.version_info >= (3, 3):
    algorithms.update({

        'codeape + iter + partial + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    for chunk in iter(partial(f.read, chunksize), b''):
                        yield from chunk
        """),

        'codeape + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while True:
                        chunk = f.read(chunksize)
                        if chunk:
                            yield from chunk
                        else:
                            break
        """),

        "jfs (mmap)": Algorithm("""
            def file_byte_iterator(filename):
                with open(filename, "rb") as f, \
                     mmap(f.fileno(), 0, access=ACCESS_READ) as s:
                    yield from s
        """),

        'Rick M. (numpy) + "yield from"': Algorithm("""
            def file_byte_iterator(filename):
            #    data = np.fromfile(filename, 'u1')
                yield from np.fromfile(filename, 'u1')
        """),

        'Vinay Sajip + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    chunk = f.read(chunksize)
                    while chunk:
                        yield from chunk  # Added in Py 3.3
                        chunk = f.read(chunksize)
        """),

    })  # End Python 3.3 update.

if sys.version_info >= (3, 5):
    algorithms.update({

        'Aaron Hall + "yield from"': Algorithm("""
            from pathlib import Path

            def file_byte_iterator(path):
                ''' Given a path, return an iterator over the file
                    that lazily loads the file.
                '''
                path = Path(path)
                bufsize = get_buffer_size(path)

                with path.open('rb') as file:
                    reader = partial(file.read1, bufsize)
                    for chunk in iter(reader, bytes()):
                        yield from chunk
        """),

    })  # End Python 3.5 update.

if sys.version_info >= (3, 8, 0):
    algorithms.update({

        'Vinay Sajip + "yield from" + "walrus operator"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while chunk := f.read(chunksize):
                        yield from chunk  # Added in Py 3.3
        """),

        'codeape + "yield from" + "walrus operator"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while chunk := f.read(chunksize):
                        yield from chunk
        """),

    })  # End Python 3.8.0 update.update.


#### Main ####

def main():
    global TEMP_FILENAME

    def cleanup():
        """ Clean up after testing is completed. """
        try:
            os.remove(TEMP_FILENAME)  # Delete the temporary file.
        except Exception:
            pass

    atexit.register(cleanup)

    # Create a named temporary binary file of pseudo-random bytes for testing.
    fd, TEMP_FILENAME = tempfile.mkstemp('.bin')
    with os.fdopen(fd, 'wb') as file:
         os.write(fd, bytearray(random.randrange(256) for _ in range(FILE_SIZE)))

    # Execute and time each algorithm, gather results.
    start_time = time.time()  # To determine how long testing itself takes.

    timings = []
    for label in algorithms:
        try:
            timing = TIMING(label,
                            min(timeit.repeat(algorithms[label].test,
                                              setup=COMMON_SETUP + algorithms[label].setup,
                                              repeat=TIMINGS, number=EXECUTIONS)))
        except Exception as exc:
            print('{} occurred timing the algorithm: "{}"\n  {}'.format(
                    type(exc).__name__, label, exc))
            traceback.print_exc(file=sys.stdout)  # Redirect to stdout.
            sys.exit(1)
        timings.append(timing)

    # Report results.
    print('Fastest to slowest execution speeds with {}-bit Python {}.{}.{}'.format(
            64 if sys.maxsize > 2**32 else 32, *sys.version_info[:3]))
    print('  numpy version {}'.format(np.version.full_version))
    print('  Test file size: {:,} KiB'.format(FILE_SIZE // KiB(1)))
    print('  {:,d} executions, best of {:d} repetitions'.format(EXECUTIONS, TIMINGS))
    print()

    longest = max(len(timing.label) for timing in timings)  # Len of longest identifier.
    ranked = sorted(timings, key=attrgetter('exec_time')) # Sort so fastest is first.
    fastest = ranked[0].exec_time
    for rank, timing in enumerate(ranked, 1):
        print('{:<2d} {:>{width}} : {:8.4f} secs, rel speed {:6.2f}x, {:6.2f}% slower '
              '({:6.2f} KiB/sec)'.format(
                    rank,
                    timing.label, timing.exec_time, round(timing.exec_time/fastest, 2),
                    round((timing.exec_time/fastest - 1) * 100, 2),
                    (FILE_SIZE/timing.exec_time) / KiB(1),  # per sec.
                    width=longest))
    print()
    mins, secs = divmod(time.time()-start_time, 60)
    print('Benchmark runtime (min:sec) - {:02d}:{:02d}'.format(int(mins),
                                                               int(round(secs))))

main()

This post itself is not a direct answer to the question. What it is instead is a data-driven extensible benchmark that can be used to compare many of the answers (and variations of utilizing new features added in later, more modern, versions of Python) that have been posted to this question — and should therefore be helpful in determining which has the best performance.

In a few cases I’ve modified the code in the referenced answer to make it compatible with the benchmark framework.

First, here are the results for what currently are the latest versions of Python 2 & 3:

Fastest to slowest execution speeds with 32-bit Python 2.7.16
  numpy version 1.16.5
  Test file size: 1,024 KiB
  100 executions, best of 3 repetitions

1                  Tcll (array.array) :   3.8943 secs, rel speed   1.00x,   0.00% slower (262.95 KiB/sec)
2  Vinay Sajip (read all into memory) :   4.1164 secs, rel speed   1.06x,   5.71% slower (248.76 KiB/sec)
3            codeape + iter + partial :   4.1616 secs, rel speed   1.07x,   6.87% slower (246.06 KiB/sec)
4                             codeape :   4.1889 secs, rel speed   1.08x,   7.57% slower (244.46 KiB/sec)
5               Vinay Sajip (chunked) :   4.1977 secs, rel speed   1.08x,   7.79% slower (243.94 KiB/sec)
6           Aaron Hall (Py 2 version) :   4.2417 secs, rel speed   1.09x,   8.92% slower (241.41 KiB/sec)
7                     gerrit (struct) :   4.2561 secs, rel speed   1.09x,   9.29% slower (240.59 KiB/sec)
8                     Rick M. (numpy) :   8.1398 secs, rel speed   2.09x, 109.02% slower (125.80 KiB/sec)
9                           Skurmedel :  31.3264 secs, rel speed   8.04x, 704.42% slower ( 32.69 KiB/sec)

Benchmark runtime (min:sec) - 03:26

Fastest to slowest execution speeds with 32-bit Python 3.8.0
  numpy version 1.17.4
  Test file size: 1,024 KiB
  100 executions, best of 3 repetitions

1  Vinay Sajip + "yield from" + "walrus operator" :   3.5235 secs, rel speed   1.00x,   0.00% slower (290.62 KiB/sec)
2                       Aaron Hall + "yield from" :   3.5284 secs, rel speed   1.00x,   0.14% slower (290.22 KiB/sec)
3         codeape + iter + partial + "yield from" :   3.5303 secs, rel speed   1.00x,   0.19% slower (290.06 KiB/sec)
4                      Vinay Sajip + "yield from" :   3.5312 secs, rel speed   1.00x,   0.22% slower (289.99 KiB/sec)
5      codeape + "yield from" + "walrus operator" :   3.5370 secs, rel speed   1.00x,   0.38% slower (289.51 KiB/sec)
6                          codeape + "yield from" :   3.5390 secs, rel speed   1.00x,   0.44% slower (289.35 KiB/sec)
7                                      jfs (mmap) :   4.0612 secs, rel speed   1.15x,  15.26% slower (252.14 KiB/sec)
8              Vinay Sajip (read all into memory) :   4.5948 secs, rel speed   1.30x,  30.40% slower (222.86 KiB/sec)
9                        codeape + iter + partial :   4.5994 secs, rel speed   1.31x,  30.54% slower (222.64 KiB/sec)
10                                        codeape :   4.5995 secs, rel speed   1.31x,  30.54% slower (222.63 KiB/sec)
11                          Vinay Sajip (chunked) :   4.6110 secs, rel speed   1.31x,  30.87% slower (222.08 KiB/sec)
12                      Aaron Hall (Py 2 version) :   4.6292 secs, rel speed   1.31x,  31.38% slower (221.20 KiB/sec)
13                             Tcll (array.array) :   4.8627 secs, rel speed   1.38x,  38.01% slower (210.58 KiB/sec)
14                                gerrit (struct) :   5.0816 secs, rel speed   1.44x,  44.22% slower (201.51 KiB/sec)
15                 Rick M. (numpy) + "yield from" :  11.8084 secs, rel speed   3.35x, 235.13% slower ( 86.72 KiB/sec)
16                                      Skurmedel :  11.8806 secs, rel speed   3.37x, 237.18% slower ( 86.19 KiB/sec)
17                                Rick M. (numpy) :  13.3860 secs, rel speed   3.80x, 279.91% slower ( 76.50 KiB/sec)

Benchmark runtime (min:sec) - 04:47

I also ran it with a much larger 10 MiB test file (which took nearly an hour to run) and got performance results which were comparable to those shown above.

Here’s the code used to do the benchmarking:

from __future__ import print_function
import array
import atexit
from collections import deque, namedtuple
import io
from mmap import ACCESS_READ, mmap
import numpy as np
from operator import attrgetter
import os
import random
import struct
import sys
import tempfile
from textwrap import dedent
import time
import timeit
import traceback

try:
    xrange
except NameError:  # Python 3
    xrange = range


class KiB(int):
    """ KibiBytes - multiples of the byte units for quantities of information. """
    def __new__(self, value=0):
        return 1024*value


BIG_TEST_FILE = 1  # MiBs or 0 for a small file.
SML_TEST_FILE = KiB(64)
EXECUTIONS = 100  # Number of times each "algorithm" is executed per timing run.
TIMINGS = 3  # Number of timing runs.
CHUNK_SIZE = KiB(8)
if BIG_TEST_FILE:
    FILE_SIZE = KiB(1024) * BIG_TEST_FILE
else:
    FILE_SIZE = SML_TEST_FILE  # For quicker testing.

# Common setup for all algorithms -- prefixed to each algorithm's setup.
COMMON_SETUP = dedent("""
    # Make accessible in algorithms.
    from __main__ import array, deque, get_buffer_size, mmap, np, struct
    from __main__ import ACCESS_READ, CHUNK_SIZE, FILE_SIZE, TEMP_FILENAME
    from functools import partial
    try:
        xrange
    except NameError:  # Python 3
        xrange = range
""")


def get_buffer_size(path):
    """ Determine optimal buffer size for reading files. """
    st = os.stat(path)
    try:
        bufsize = st.st_blksize # Available on some Unix systems (like Linux)
    except AttributeError:
        bufsize = io.DEFAULT_BUFFER_SIZE
    return bufsize

# Utility primarily for use when embedding additional algorithms into benchmark.
VERIFY_NUM_READ = """
    # Verify generator reads correct number of bytes (assumes values are correct).
    bytes_read = sum(1 for _ in file_byte_iterator(TEMP_FILENAME))
    assert bytes_read == FILE_SIZE, \
           'Wrong number of bytes generated: got {:,} instead of {:,}'.format(
                bytes_read, FILE_SIZE)
"""

TIMING = namedtuple('TIMING', 'label, exec_time')

class Algorithm(namedtuple('CodeFragments', 'setup, test')):

    # Default timeit "stmt" code fragment.
    _TEST = """
        #for b in file_byte_iterator(TEMP_FILENAME):  # Loop over every byte.
        #    pass  # Do stuff with byte...
        deque(file_byte_iterator(TEMP_FILENAME), maxlen=0)  # Data sink.
    """

    # Must overload __new__ because (named)tuples are immutable.
    def __new__(cls, setup, test=None):
        """ Dedent (unindent) code fragment string arguments.
        Args:
          `setup` -- Code fragment that defines things used by `test` code.
                     In this case it should define a generator function named
                     `file_byte_iterator()` that will be passed that name of a test file
                     of binary data. This code is not timed.
          `test` -- Code fragment that uses things defined in `setup` code.
                    Defaults to _TEST. This is the code that's timed.
        """
        test =  cls._TEST if test is None else test  # Use default unless one is provided.

        # Uncomment to replace all performance tests with one that verifies the correct
        # number of bytes values are being generated by the file_byte_iterator function.
        #test = VERIFY_NUM_READ

        return tuple.__new__(cls, (dedent(setup), dedent(test)))


algorithms = {

    'Aaron Hall (Py 2 version)': Algorithm("""
        def file_byte_iterator(path):
            with open(path, "rb") as file:
                callable = partial(file.read, 1024)
                sentinel = bytes() # or b''
                for chunk in iter(callable, sentinel):
                    for byte in chunk:
                        yield byte
    """),

    "codeape": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                while True:
                    chunk = f.read(chunksize)
                    if chunk:
                        for b in chunk:
                            yield b
                    else:
                        break
    """),

    "codeape + iter + partial": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                for chunk in iter(partial(f.read, chunksize), b''):
                    for b in chunk:
                        yield b
    """),

    "gerrit (struct)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                fmt = '{}B'.format(FILE_SIZE)  # Reads entire file at once.
                for b in struct.unpack(fmt, f.read()):
                    yield b
    """),

    'Rick M. (numpy)': Algorithm("""
        def file_byte_iterator(filename):
            for byte in np.fromfile(filename, 'u1'):
                yield byte
    """),

    "Skurmedel": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                byte = f.read(1)
                while byte:
                    yield byte
                    byte = f.read(1)
    """),

    "Tcll (array.array)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                arr = array.array('B')
                arr.fromfile(f, FILE_SIZE)  # Reads entire file at once.
                for b in arr:
                    yield b
    """),

    "Vinay Sajip (read all into memory)": Algorithm("""
        def file_byte_iterator(filename):
            with open(filename, "rb") as f:
                bytes_read = f.read()  # Reads entire file at once.
            for b in bytes_read:
                yield b
    """),

    "Vinay Sajip (chunked)": Algorithm("""
        def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
            with open(filename, "rb") as f:
                chunk = f.read(chunksize)
                while chunk:
                    for b in chunk:
                        yield b
                    chunk = f.read(chunksize)
    """),

}  # End algorithms

#
# Versions of algorithms that will only work in certain releases (or better) of Python.
#
if sys.version_info >= (3, 3):
    algorithms.update({

        'codeape + iter + partial + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    for chunk in iter(partial(f.read, chunksize), b''):
                        yield from chunk
        """),

        'codeape + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while True:
                        chunk = f.read(chunksize)
                        if chunk:
                            yield from chunk
                        else:
                            break
        """),

        "jfs (mmap)": Algorithm("""
            def file_byte_iterator(filename):
                with open(filename, "rb") as f, \
                     mmap(f.fileno(), 0, access=ACCESS_READ) as s:
                    yield from s
        """),

        'Rick M. (numpy) + "yield from"': Algorithm("""
            def file_byte_iterator(filename):
            #    data = np.fromfile(filename, 'u1')
                yield from np.fromfile(filename, 'u1')
        """),

        'Vinay Sajip + "yield from"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    chunk = f.read(chunksize)
                    while chunk:
                        yield from chunk  # Added in Py 3.3
                        chunk = f.read(chunksize)
        """),

    })  # End Python 3.3 update.

if sys.version_info >= (3, 5):
    algorithms.update({

        'Aaron Hall + "yield from"': Algorithm("""
            from pathlib import Path

            def file_byte_iterator(path):
                ''' Given a path, return an iterator over the file
                    that lazily loads the file.
                '''
                path = Path(path)
                bufsize = get_buffer_size(path)

                with path.open('rb') as file:
                    reader = partial(file.read1, bufsize)
                    for chunk in iter(reader, bytes()):
                        yield from chunk
        """),

    })  # End Python 3.5 update.

if sys.version_info >= (3, 8, 0):
    algorithms.update({

        'Vinay Sajip + "yield from" + "walrus operator"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while chunk := f.read(chunksize):
                        yield from chunk  # Added in Py 3.3
        """),

        'codeape + "yield from" + "walrus operator"': Algorithm("""
            def file_byte_iterator(filename, chunksize=CHUNK_SIZE):
                with open(filename, "rb") as f:
                    while chunk := f.read(chunksize):
                        yield from chunk
        """),

    })  # End Python 3.8.0 update.update.


#### Main ####

def main():
    global TEMP_FILENAME

    def cleanup():
        """ Clean up after testing is completed. """
        try:
            os.remove(TEMP_FILENAME)  # Delete the temporary file.
        except Exception:
            pass

    atexit.register(cleanup)

    # Create a named temporary binary file of pseudo-random bytes for testing.
    fd, TEMP_FILENAME = tempfile.mkstemp('.bin')
    with os.fdopen(fd, 'wb') as file:
         os.write(fd, bytearray(random.randrange(256) for _ in range(FILE_SIZE)))

    # Execute and time each algorithm, gather results.
    start_time = time.time()  # To determine how long testing itself takes.

    timings = []
    for label in algorithms:
        try:
            timing = TIMING(label,
                            min(timeit.repeat(algorithms[label].test,
                                              setup=COMMON_SETUP + algorithms[label].setup,
                                              repeat=TIMINGS, number=EXECUTIONS)))
        except Exception as exc:
            print('{} occurred timing the algorithm: "{}"\n  {}'.format(
                    type(exc).__name__, label, exc))
            traceback.print_exc(file=sys.stdout)  # Redirect to stdout.
            sys.exit(1)
        timings.append(timing)

    # Report results.
    print('Fastest to slowest execution speeds with {}-bit Python {}.{}.{}'.format(
            64 if sys.maxsize > 2**32 else 32, *sys.version_info[:3]))
    print('  numpy version {}'.format(np.version.full_version))
    print('  Test file size: {:,} KiB'.format(FILE_SIZE // KiB(1)))
    print('  {:,d} executions, best of {:d} repetitions'.format(EXECUTIONS, TIMINGS))
    print()

    longest = max(len(timing.label) for timing in timings)  # Len of longest identifier.
    ranked = sorted(timings, key=attrgetter('exec_time')) # Sort so fastest is first.
    fastest = ranked[0].exec_time
    for rank, timing in enumerate(ranked, 1):
        print('{:<2d} {:>{width}} : {:8.4f} secs, rel speed {:6.2f}x, {:6.2f}% slower '
              '({:6.2f} KiB/sec)'.format(
                    rank,
                    timing.label, timing.exec_time, round(timing.exec_time/fastest, 2),
                    round((timing.exec_time/fastest - 1) * 100, 2),
                    (FILE_SIZE/timing.exec_time) / KiB(1),  # per sec.
                    width=longest))
    print()
    mins, secs = divmod(time.time()-start_time, 60)
    print('Benchmark runtime (min:sec) - {:02d}:{:02d}'.format(int(mins),
                                                               int(round(secs))))

main()

回答 10

如果您正在寻找快速的东西,这是我多年来一直使用的一种方法:

from array import array

with open( path, 'rb' ) as file:
    data = array( 'B', file.read() ) # buffer the file

# evaluate it's data
for byte in data:
    v = byte # int value
    c = chr(byte)

如果要迭代char而不是ints,则可以简单地使用data = file.read(),它应该是py3中的bytes()对象。

if you are looking for something speedy, here’s a method I’ve been using that’s worked for years:

from array import array

with open( path, 'rb' ) as file:
    data = array( 'B', file.read() ) # buffer the file

# evaluate it's data
for byte in data:
    v = byte # int value
    c = chr(byte)

if you want to iterate chars instead of ints, you can simply use data = file.read(), which should be a bytes() object in py3.


如何编写内联if语句以进行打印?

问题:如何编写内联if语句以进行打印?

我仅在将布尔变量设置为时才需要打印一些内容True。因此,看完这个之后,我尝试了一个简单的示例:

>>> a = 100
>>> b = True
>>> print a if b
  File "<stdin>", line 1
    print a if b
             ^
SyntaxError: invalid syntax  

如果我写的话也是一样print a if b==True

我在这里想念什么?

I need to print some stuff only when a boolean variable is set to True. So, after looking at this, I tried with a simple example:

>>> a = 100
>>> b = True
>>> print a if b
  File "<stdin>", line 1
    print a if b
             ^
SyntaxError: invalid syntax  

Same thing if I write print a if b==True.

What am I missing here?


回答 0

Python不不会有拖尾if 声明

ifPython 有两种:

  1. if 声明:

    if condition: statement
    if condition:
        block
    
  2. if 表达式(在Python 2.5中引入)

    expression_if_true if condition else expression_if_false

请注意,两者print ab = a都是陈述。只有a一部分是表达。所以如果你写

print a if b else 0

它的意思是

print (a if b else 0)

当你写的时候

x = a if b else 0

它的意思是

x = (a if b else 0)

现在,如果没有else子句,它将打印/分配什么?打印/分配仍然存在

请注意,如果您不希望它存在,您总是可以if在一行上编写常规语句,尽管它的可读性较差,并且确实没有理由避免使用两行变体。

Python does not have a trailing if statement.

There are two kinds of if in Python:

  1. if statement:

    if condition: statement
    if condition:
        block
    
  2. if expression (introduced in Python 2.5)

    expression_if_true if condition else expression_if_false
    

And note, that both print a and b = a are statements. Only the a part is an expression. So if you write

print a if b else 0

it means

print (a if b else 0)

and similarly when you write

x = a if b else 0

it means

x = (a if b else 0)

Now what would it print/assign if there was no else clause? The print/assignment is still there.

And note, that if you don’t want it to be there, you can always write the regular if statement on a single line, though it’s less readable and there is really no reason to avoid the two-line variant.


回答 1

内联if-else EXPRESSION必须始终包含else子句,例如:

a = 1 if b else 0

如果您想保持’a’变量值不变-修改旧的’a’值(语法要求中还需要其他):

a = 1 if b else a

这段代码留下一个不变当b转弯是假的。

Inline if-else EXPRESSION must always contain else clause, e.g:

a = 1 if b else 0

If you want to leave your ‘a’ variable value unchanged – assing old ‘a’ value (else is still required by syntax demands):

a = 1 if b else a

This piece of code leaves a unchanged when b turns to be False.


回答 2

‘else’语句是强制性的。您可以执行以下操作:

>>> b = True
>>> a = 1 if b else None
>>> a
1
>>> b = False
>>> a = 1 if b else None
>>> a
>>> 

编辑:

或者,根据您的需要,您可以尝试:

>>> if b: print(a)

The ‘else’ statement is mandatory. You can do stuff like this :

>>> b = True
>>> a = 1 if b else None
>>> a
1
>>> b = False
>>> a = 1 if b else None
>>> a
>>> 

EDIT:

Or, depending of your needs, you may try:

>>> if b: print(a)

回答 3

如果您不想这样做,from __future__ import print_function可以执行以下操作:

a = 100
b = True
print a if b else "",  # Note the comma!
print "see no new line"

哪些打印:

100 see no new line

如果您不反对from __future__ import print_function或使用python 3或更高版本:

from __future__ import print_function
a = False
b = 100
print(b if a else "", end = "")

添加else是您需要进行的唯一更改,以使您的代码在语法上正确无误,您需要条件条件表达式中的else(“如果else阻塞,则行内”)

我没有使用None0与线程中的其他线程一样使用过的None/0原因是,使用会导致程序进入print Noneprint 0处于bis 的情况False

如果您想阅读有关此主题的内容,我提供了指向该功能已添加到Python的补丁程序的发行说明的链接

上面的“模式”与PEP 308中显示的模式非常相似:

这种语法可能看起来很奇怪而且倒退。为什么条件出现在表达式的中间,而不是C的c的前面?x:y?通过将新语法应用于标准库中的模块并查看结果代码的读取方式来检查该决定。在许多使用条件表达式的情况下,一个值似乎是“常见情况”,而一个值是“exceptions情况”,仅在不满足条件的极少数情况下使用。条件语法使此模式更为明显:

内容=((doc +’\ n’)如果文档为其他”)

因此,我认为总体而言,这是一种合理的处理方式,但是您不能凭借以下简单性来参数:

if logging: print data

If you don’t want to from __future__ import print_function you can do the following:

a = 100
b = True
print a if b else "",  # Note the comma!
print "see no new line"

Which prints:

100 see no new line

If you’re not aversed to from __future__ import print_function or are using python 3 or later:

from __future__ import print_function
a = False
b = 100
print(b if a else "", end = "")

Adding the else is the only change you need to make to make your code syntactically correct, you need the else for the conditional expression (the “in line if else blocks”)

The reason I didn’t use None or 0 like others in the thread have used, is because using None/0 would cause the program to print None or print 0 in the cases where b is False.

If you want to read about this topic I’ve included a link to the release notes for the patch that this feature was added to Python.

The ‘pattern’ above is very similar to the pattern shown in PEP 308:

This syntax may seem strange and backwards; why does the condition go in the middle of the expression, and not in the front as in C’s c ? x : y? The decision was checked by applying the new syntax to the modules in the standard library and seeing how the resulting code read. In many cases where a conditional expression is used, one value seems to be the ‘common case’ and one value is an ‘exceptional case’, used only on rarer occasions when the condition isn’t met. The conditional syntax makes this pattern a bit more obvious:

contents = ((doc + ‘\n’) if doc else ”)

So I think overall this is a reasonable way of approching it but you can’t argue with the simplicity of:

if logging: print data

回答 4

从2.5开始,您可以使用C的等价 三元条件运算符,其语法为:

[on_true] if [expression] else [on_false]

所以您的示例很好,但是您只需添加else,例如:

print a if b else ''

Since 2.5 you can use equivalent of C’s ”?:” ternary conditional operator and the syntax is:

[on_true] if [expression] else [on_false]

So your example is fine, but you’ve to simply add else, like:

print a if b else ''

回答 5

您可以使用:

print (1==2 and "only if condition true" or "in case condition is false")

同样,您可以继续以下操作:

print 1==2 and "aa" or ((2==3) and "bb" or "cc")

真实的例子:

>>> print "%d item%s found." % (count, (count>1 and 's' or ''))
1 item found.
>>> count = 2
>>> print "%d item%s found." % (count, (count>1 and 's' or ''))
2 items found.

You can use:

print (1==2 and "only if condition true" or "in case condition is false")

Just as well you can keep going like:

print 1==2 and "aa" or ((2==3) and "bb" or "cc")

Real world example:

>>> print "%d item%s found." % (count, (count>1 and 's' or ''))
1 item found.
>>> count = 2
>>> print "%d item%s found." % (count, (count>1 and 's' or ''))
2 items found.

回答 6

这可以通过字符串格式化来完成。它适用于%表示法以及.format()f字符串(3.6的新功能)

print '%s' % (a if b else "")

要么

print '{}'.format(a if b else "")

要么

print(f'{a if b else ""}')

This can be done with string formatting. It works with the % notation as well as .format() and f-strings (new to 3.6)

print '%s' % (a if b else "")

or

print '{}'.format(a if b else "")

or

print(f'{a if b else ""}')

回答 7

对于您的情况,这可行:

a = b or 0

编辑:这是如何工作的?

在问题中

b = True

所以评估

b or 0

结果是

True

分配给a

如果b == False?b or 0将求值到0将分配给的第二个操作数a

For your case this works:

a = b or 0

Edit: How does this work?

In the question

b = True

So evaluating

b or 0

results in

True

which is assigned to a.

If b == False?, b or 0 would evaluate to the second operand 0 which would be assigned to a.


回答 8

尝试这个 。可能对你有帮助

a=100
b=True

if b:
   print a

Try this . It might help you

a=100
b=True

if b:
   print a

回答 9

您只是太复杂了。

if b:
   print a

You’re simply overcomplicating.

if b:
   print a

回答 10

在以下情况下,您始终需要else内联:

a = 1 if b else 0

但是更简单的方法是a = int(b)

You always need an else in an inline if:

a = 1 if b else 0

But an easier way to do it would be a = int(b).


回答 11

好吧,为什么不简单地写:

if b:
    print a
else:
    print 'b is false'

Well why don’t you simply write:

if b:
    print a
else:
    print 'b is false'

回答 12

嗯,您可以通过列表理解来做到这一点。这只有在您拥有真实范围的情况下才有意义..但是它确实可以做到:

print([a for i in range(0,1) if b])

或仅使用这两个变量:

print([a for a in range(a,a+1) if b])

hmmm, you can do it with a list comprehension. This would only make sense if you had a real range.. but it does do the job:

print([a for i in range(0,1) if b])

or using just those two variables:

print([a for a in range(a,a+1) if b])

如何计算Python中ndarray中某些项目的出现?

问题:如何计算Python中ndarray中某些项目的出现?

在Python中,我有一个ndarray y 打印为array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

我试图计算这个数组中有多少个0和多少个1

但是当我输入y.count(0)or时y.count(1),它说

numpy.ndarray 对象没有属性 count

我该怎么办?

In Python, I have an ndarray y that is printed as array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

I’m trying to count how many 0s and how many 1s are there in this array.

But when I type y.count(0) or y.count(1), it says

numpy.ndarray object has no attribute count

What should I do?


回答 0

>>> a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
>>> unique, counts = numpy.unique(a, return_counts=True)
>>> dict(zip(unique, counts))
{0: 7, 1: 4, 2: 1, 3: 2, 4: 1}

非numpy方式

使用collections.Counter;

>> import collections, numpy

>>> a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
>>> collections.Counter(a)
Counter({0: 7, 1: 4, 3: 2, 2: 1, 4: 1})
>>> a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
>>> unique, counts = numpy.unique(a, return_counts=True)
>>> dict(zip(unique, counts))
{0: 7, 1: 4, 2: 1, 3: 2, 4: 1}

Non-numpy way:

Use collections.Counter;

>> import collections, numpy

>>> a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
>>> collections.Counter(a)
Counter({0: 7, 1: 4, 3: 2, 2: 1, 4: 1})

回答 1

那使用numpy.count_nonzero什么呢

>>> import numpy as np
>>> y = np.array([1, 2, 2, 2, 2, 0, 2, 3, 3, 3, 0, 0, 2, 2, 0])

>>> np.count_nonzero(y == 1)
1
>>> np.count_nonzero(y == 2)
7
>>> np.count_nonzero(y == 3)
3

What about using numpy.count_nonzero, something like

>>> import numpy as np
>>> y = np.array([1, 2, 2, 2, 2, 0, 2, 3, 3, 3, 0, 0, 2, 2, 0])

>>> np.count_nonzero(y == 1)
1
>>> np.count_nonzero(y == 2)
7
>>> np.count_nonzero(y == 3)
3

回答 2

就个人而言,我会去: (y == 0).sum()(y == 1).sum()

例如

import numpy as np
y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
num_zeros = (y == 0).sum()
num_ones = (y == 1).sum()

Personally, I’d go for: (y == 0).sum() and (y == 1).sum()

E.g.

import numpy as np
y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
num_zeros = (y == 0).sum()
num_ones = (y == 1).sum()

回答 3

对于您的情况,您还可以查看numpy.bincount

In [56]: a = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

In [57]: np.bincount(a)
Out[57]: array([8, 4])  #count of zeros is at index 0 : 8
                        #count of ones is at index 1 : 4

For your case you could also look into numpy.bincount

In [56]: a = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

In [57]: np.bincount(a)
Out[57]: array([8, 4])  #count of zeros is at index 0 : 8
                        #count of ones is at index 1 : 4

回答 4

将数组转换y为列表l,然后执行l.count(1)l.count(0)

>>> y = numpy.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
>>> l = list(y)
>>> l.count(1)
4
>>> l.count(0)
8 

Convert your array y to list l and then do l.count(1) and l.count(0)

>>> y = numpy.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
>>> l = list(y)
>>> l.count(1)
4
>>> l.count(0)
8 

回答 5

y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

如果你知道,他们只是01

np.sum(y)

给你的数量。 np.sum(1-y)给出零。

为了稍微概括起见,如果要计数0而不是零(但可能是2或3):

np.count_nonzero(y)

给出非零的数量。

但是,如果您需要更复杂的东西,我认为numpy不会提供一个不错的count选择。在这种情况下,请转到集合:

import collections
collections.Counter(y)
> Counter({0: 8, 1: 4})

这就像一个字典

collections.Counter(y)[0]
> 8
y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

If you know that they are just 0 and 1:

np.sum(y)

gives you the number of ones. np.sum(1-y) gives the zeroes.

For slight generality, if you want to count 0 and not zero (but possibly 2 or 3):

np.count_nonzero(y)

gives the number of nonzero.

But if you need something more complicated, I don’t think numpy will provide a nice count option. In that case, go to collections:

import collections
collections.Counter(y)
> Counter({0: 8, 1: 4})

This behaves like a dict

collections.Counter(y)[0]
> 8

回答 6

如果您确切知道要查找的号码,则可以使用以下代码;

lst = np.array([1,1,2,3,3,6,6,6,3,2,1])
(lst == 2).sum()

返回数组中发生2的次数。

If you know exactly which number you’re looking for, you can use the following;

lst = np.array([1,1,2,3,3,6,6,6,3,2,1])
(lst == 2).sum()

returns how many times 2 is occurred in your array.


回答 7

老实说,我发现将其转换为pandas系列或DataFrame最简单:

import pandas as pd
import numpy as np

df = pd.DataFrame({'data':np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])})
print df['data'].value_counts()

或罗伯特·穆伊(Robert Muil)提出的这一好话:

pd.Series([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]).value_counts()

Honestly I find it easiest to convert to a pandas Series or DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({'data':np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])})
print df['data'].value_counts()

Or this nice one-liner suggested by Robert Muil:

pd.Series([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]).value_counts()

回答 8

没有人建议使用numpy.bincount(input, minlength)minlength = np.size(input),但它似乎是一个很好的解决方案,并且绝对是最快的

In [1]: choices = np.random.randint(0, 100, 10000)

In [2]: %timeit [ np.sum(choices == k) for k in range(min(choices), max(choices)+1) ]
100 loops, best of 3: 2.67 ms per loop

In [3]: %timeit np.unique(choices, return_counts=True)
1000 loops, best of 3: 388 µs per loop

In [4]: %timeit np.bincount(choices, minlength=np.size(choices))
100000 loops, best of 3: 16.3 µs per loop

numpy.unique(x, return_counts=True)和之间的疯狂加速numpy.bincount(x, minlength=np.max(x))

No one suggested to use numpy.bincount(input, minlength) with minlength = np.size(input), but it seems to be a good solution, and definitely the fastest:

In [1]: choices = np.random.randint(0, 100, 10000)

In [2]: %timeit [ np.sum(choices == k) for k in range(min(choices), max(choices)+1) ]
100 loops, best of 3: 2.67 ms per loop

In [3]: %timeit np.unique(choices, return_counts=True)
1000 loops, best of 3: 388 µs per loop

In [4]: %timeit np.bincount(choices, minlength=np.size(choices))
100000 loops, best of 3: 16.3 µs per loop

That’s a crazy speedup between numpy.unique(x, return_counts=True) and numpy.bincount(x, minlength=np.max(x)) !


回答 9

怎么样len(y[y==0])len(y[y==1])

What about len(y[y==0]) and len(y[y==1]) ?


回答 10

y.tolist().count(val)

使用val 0或1

由于python列表具有本机函数count,因此在使用该函数之前将其转换为list是一个简单的解决方案。

y.tolist().count(val)

with val 0 or 1

Since a python list has a native function count, converting to list before using that function is a simple solution.


回答 11

另一个简单的解决方案可能是使用numpy.count_nonzero()

import numpy as np
y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
y_nonzero_num = np.count_nonzero(y==1)
y_zero_num = np.count_nonzero(y==0)
y_nonzero_num
4
y_zero_num
8

不要让名称误导您,如果您像示例中那样将其与布尔值一起使用,就可以解决问题。

Yet another simple solution might be to use numpy.count_nonzero():

import numpy as np
y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
y_nonzero_num = np.count_nonzero(y==1)
y_zero_num = np.count_nonzero(y==0)
y_nonzero_num
4
y_zero_num
8

Don’t let the name mislead you, if you use it with the boolean just like in the example, it will do the trick.


回答 12

要计算出现次数,可以使用np.unique(array, return_counts=True)

In [75]: boo = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

# use bool value `True` or equivalently `1`
In [77]: uniq, cnts = np.unique(boo, return_counts=1)
In [81]: uniq
Out[81]: array([0, 1])   #unique elements in input array are: 0, 1

In [82]: cnts
Out[82]: array([8, 4])   # 0 occurs 8 times, 1 occurs 4 times

To count the number of occurrences, you can use np.unique(array, return_counts=True):

In [75]: boo = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

# use bool value `True` or equivalently `1`
In [77]: uniq, cnts = np.unique(boo, return_counts=1)
In [81]: uniq
Out[81]: array([0, 1])   #unique elements in input array are: 0, 1

In [82]: cnts
Out[82]: array([8, 4])   # 0 occurs 8 times, 1 occurs 4 times

回答 13

我会使用np.where:

how_many_0 = len(np.where(a==0.)[0])
how_many_1 = len(np.where(a==1.)[0])

I’d use np.where:

how_many_0 = len(np.where(a==0.)[0])
how_many_1 = len(np.where(a==1.)[0])

回答 14

利用系列提供的方法:

>>> import pandas as pd
>>> y = [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]
>>> pd.Series(y).value_counts()
0    8
1    4
dtype: int64

take advantage of the methods offered by a Series:

>>> import pandas as pd
>>> y = [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]
>>> pd.Series(y).value_counts()
0    8
1    4
dtype: int64

回答 15

一个简单的一般答案是:

numpy.sum(MyArray==x)   # sum of a binary list of the occurence of x (=0 or 1) in MyArray

这将导致完整的代码作为示例

import numpy
MyArray=numpy.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])  # array we want to search in
x=0   # the value I want to count (can be iterator, in a list, etc.)
numpy.sum(MyArray==0)   # sum of a binary list of the occurence of x in MyArray

现在,如果MyArray具有多个维度,并且您要计算行中值分布的出现次数(此后为pattern)

MyArray=numpy.array([[6, 1],[4, 5],[0, 7],[5, 1],[2, 5],[1, 2],[3, 2],[0, 2],[2, 5],[5, 1],[3, 0]])
x=numpy.array([5,1])   # the value I want to count (can be iterator, in a list, etc.)
temp = numpy.ascontiguousarray(MyArray).view(numpy.dtype((numpy.void, MyArray.dtype.itemsize * MyArray.shape[1])))  # convert the 2d-array into an array of analyzable patterns
xt=numpy.ascontiguousarray(x).view(numpy.dtype((numpy.void, x.dtype.itemsize * x.shape[0])))  # convert what you search into one analyzable pattern
numpy.sum(temp==xt)  # count of the searched pattern in the list of patterns

A general and simple answer would be:

numpy.sum(MyArray==x)   # sum of a binary list of the occurence of x (=0 or 1) in MyArray

which would result into this full code as exemple

import numpy
MyArray=numpy.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])  # array we want to search in
x=0   # the value I want to count (can be iterator, in a list, etc.)
numpy.sum(MyArray==0)   # sum of a binary list of the occurence of x in MyArray

Now if MyArray is in multiple dimensions and you want to count the occurence of a distribution of values in line (= pattern hereafter)

MyArray=numpy.array([[6, 1],[4, 5],[0, 7],[5, 1],[2, 5],[1, 2],[3, 2],[0, 2],[2, 5],[5, 1],[3, 0]])
x=numpy.array([5,1])   # the value I want to count (can be iterator, in a list, etc.)
temp = numpy.ascontiguousarray(MyArray).view(numpy.dtype((numpy.void, MyArray.dtype.itemsize * MyArray.shape[1])))  # convert the 2d-array into an array of analyzable patterns
xt=numpy.ascontiguousarray(x).view(numpy.dtype((numpy.void, x.dtype.itemsize * x.shape[0])))  # convert what you search into one analyzable pattern
numpy.sum(temp==xt)  # count of the searched pattern in the list of patterns

回答 16

您可以使用字典理解来创建整齐的单线。可以在这里找到有关字典理解的更多信息

>>>counts = {int(value): list(y).count(value) for value in set(y)}
>>>print(counts)
{0: 8, 1: 4}

这将创建一个字典,将ndarray中的值作为键,并将值的计数分别作为键的值。

每当您要计算此格式数组中某个值的出现次数时,此方法都将起作用。

You can use dictionary comprehension to create a neat one-liner. More about dictionary comprehension can be found here

>>>counts = {int(value): list(y).count(value) for value in set(y)}
>>>print(counts)
{0: 8, 1: 4}

This will create a dictionary with the values in your ndarray as keys, and the counts of the values as the values for the keys respectively.

This will work whenever you want to count occurences of a value in arrays of this format.


回答 17

尝试这个:

a = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
list(a).count(1)

Try this:

a = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
list(a).count(1)

回答 18

这可以通过以下方法轻松完成

y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
y.tolist().count(1)

This can be done easily in the following method

y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
y.tolist().count(1)

回答 19

由于您的ndarray仅包含0和1,因此您可以使用sum()获得1的出现,并使用len()-sum()获得0的出现。

num_of_ones = sum(array)
num_of_zeros = len(array)-sum(array)

Since your ndarray contains only 0 and 1, you can use sum() to get the occurrence of 1s and len()-sum() to get the occurrence of 0s.

num_of_ones = sum(array)
num_of_zeros = len(array)-sum(array)

回答 20

您有一个只有1和0的特殊数组。所以一个诀窍是使用

np.mean(x)

这将为您提供数组中1s的百分比。或者,使用

np.sum(x)
np.sum(1-x)

将为您提供数组中1和0的绝对数。

You have a special array with only 1 and 0 here. So a trick is to use

np.mean(x)

which gives you the percentage of 1s in your array. Alternatively, use

np.sum(x)
np.sum(1-x)

will give you the absolute number of 1 and 0 in your array.


回答 21

dict(zip(*numpy.unique(y, return_counts=True)))

刚刚在此处复制了Seppo Enarvi的评论,这应该是一个正确的答案

dict(zip(*numpy.unique(y, return_counts=True)))

Just copied Seppo Enarvi’s comment here which deserves to be a proper answer


回答 22

它涉及更多的步骤,但是对2d数组和更复杂的过滤器也适用的更灵活的解决方案是创建一个布尔掩码,然后在掩码上使用.sum()。

>>>>y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
>>>>mask = y == 0
>>>>mask.sum()
8

It involves one more step, but a more flexible solution which would also work for 2d arrays and more complicated filters is to create a boolean mask and then use .sum() on the mask.

>>>>y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
>>>>mask = y == 0
>>>>mask.sum()
8

回答 23

如果您不想使用numpy或collections模块,则可以使用字典:

d = dict()
a = [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]
for item in a:
    try:
        d[item]+=1
    except KeyError:
        d[item]=1

结果:

>>>d
{0: 8, 1: 4}

当然,您也可以使用if / else语句。我认为Counter函数的功能几乎相同,但这更加透明。

If you don’t want to use numpy or a collections module you can use a dictionary:

d = dict()
a = [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]
for item in a:
    try:
        d[item]+=1
    except KeyError:
        d[item]=1

result:

>>>d
{0: 8, 1: 4}

Of course you can also use an if/else statement. I think the Counter function does almost the same thing but this is more transparant.


回答 24

对于通用条目:

x = np.array([11, 2, 3, 5, 3, 2, 16, 10, 10, 3, 11, 4, 5, 16, 3, 11, 4])
n = {i:len([j for j in np.where(x==i)[0]]) for i in set(x)}
ix = {i:[j for j in np.where(x==i)[0]] for i in set(x)}

将输出一个计数:

{2: 2, 3: 4, 4: 2, 5: 2, 10: 2, 11: 3, 16: 2}

和索引:

{2: [1, 5],
3: [2, 4, 9, 14],
4: [11, 16],
5: [3, 12],
10: [7, 8],
11: [0, 10, 15],
16: [6, 13]}

For generic entries:

x = np.array([11, 2, 3, 5, 3, 2, 16, 10, 10, 3, 11, 4, 5, 16, 3, 11, 4])
n = {i:len([j for j in np.where(x==i)[0]]) for i in set(x)}
ix = {i:[j for j in np.where(x==i)[0]] for i in set(x)}

Will output a count:

{2: 2, 3: 4, 4: 2, 5: 2, 10: 2, 11: 3, 16: 2}

And indices:

{2: [1, 5],
3: [2, 4, 9, 14],
4: [11, 16],
5: [3, 12],
10: [7, 8],
11: [0, 10, 15],
16: [6, 13]}

回答 25

这里有一些东西,通过它您可以计算出特定数字的出现次数:根据您的代码

count_of_zero = list(y [y == 0])。count(0)

打印(count_of_zero)

//根据匹配项,将有布尔值,根据True值,将返回数字0

here I have something, through which you can count the number of occurrence of a particular number: according to your code

count_of_zero=list(y[y==0]).count(0)

print(count_of_zero)

// according to the match there will be boolean values and according to True value the number 0 will be return


回答 26

如果您对最快的执行感兴趣,那么您会事先知道要查找的值,并且您的数组是一维的,否则您对展平数组上的结果感兴趣(在这种情况下,函数的输入应是np.flatten(arr)不是只arr),然后Numba是你的朋友:

import numba as nb


@nb.jit
def count_nb(arr, value):
    result = 0
    for x in arr:
        if x == value:
            result += 1
    return result

或者,对于超大型阵列,并行化可能会有所帮助:

@nb.jit(parallel=True)
def count_nbp(arr, value):
    result = 0
    for i in nb.prange(arr.size):
        if arr[i] == value:
            result += 1
    return result

对这些基准进行基准测试np.count_nonzero()(也存在创建可以避免的临时数组的问题)和np.unique()基于-的解决方案

import numpy as np


def count_np(arr, value):
    return np.count_nonzero(arr == value)
import numpy as np


def count_np2(arr, value):
    uniques, counts = np.unique(a, return_counts=True)
    counter = dict(zip(uniques, counts))
    return counter[value] if value in counter else 0 

用于使用以下命令生成的输入:

def gen_input(n, a=0, b=100):
    return np.random.randint(a, b, n)

获得以下图(图的第二行是对更快方法的放大):

bm_full bm_zoom

表明基于Numba的解决方案比NumPy的解决方案明显更快,并且对于非常大的输入,并行方法比朴素的方法要快。


完整的代码在这里

If you are interested in the fastest execution, you know in advance which value(s) to look for, and your array is 1D, or you are otherwise interested in the result on the flattened array (in which case the input of the function should be np.flatten(arr) rather than just arr), then Numba is your friend:

import numba as nb


@nb.jit
def count_nb(arr, value):
    result = 0
    for x in arr:
        if x == value:
            result += 1
    return result

or, for very large arrays where parallelization may be beneficial:

@nb.jit(parallel=True)
def count_nbp(arr, value):
    result = 0
    for i in nb.prange(arr.size):
        if arr[i] == value:
            result += 1
    return result

Benchmarking these against np.count_nonzero() (which also has a problem of creating a temporary array which may be avoided) and np.unique()-based solution

import numpy as np


def count_np(arr, value):
    return np.count_nonzero(arr == value)
import numpy as np


def count_np2(arr, value):
    uniques, counts = np.unique(a, return_counts=True)
    counter = dict(zip(uniques, counts))
    return counter[value] if value in counter else 0 

for input generated with:

def gen_input(n, a=0, b=100):
    return np.random.randint(a, b, n)

the following plots are obtained (the second row of plots is a zoom on the faster approach):

bm_full bm_zoom

Showing that Numba-based solution are noticeably faster than the NumPy counterparts, and, for very large inputs, the parallel approach is faster than the naive one.


Full code available here.


回答 27

如果使用生成器处理非常大的数组,则可以选择。令人高兴的是,这种方法对数组和列表都适用,并且您不需要任何其他程序包。此外,您没有使用太多的内存。

my_array = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
sum(1 for val in my_array if val==0)
Out: 8

if you are dealing with very large arrays using generators could be an option. The nice thing here it that this approach works fine for both arrays and lists and you dont need any additional package. Additionally, you are not using that much memory.

my_array = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
sum(1 for val in my_array if val==0)
Out: 8

回答 28

Numpy为此提供了一个模块。只是一个小技巧。将您的输入数组作为垃圾箱。

numpy.histogram(y, bins=y)

输出是2个数组。一个带有值本身,另一个带有相应的频率。

Numpy has a module for this. Just a small hack. Put your input array as bins.

numpy.histogram(y, bins=y)

The output are 2 arrays. One with the values itself, other with the corresponding frequencies.


回答 29

using numpy.count

$ a = [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]

$ np.count(a, 1)
using numpy.count

$ a = [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]

$ np.count(a, 1)