However, I’m wondering if there’s a simpler way of achieving the same results. I get the impression that using os.walk only to return the top level is inefficient/too much.
回答 0
使用os.path.isdir()过滤结果(并使用os.path.join()获得真实路径):
>>>[ name for name in os.listdir(thedir)if os.path.isdir(os.path.join(thedir, name))]['ctypes','distutils','encodings','lib-tk','config','idlelib','xml','bsddb','hotshot','logging','doc','test','compiler','curses','site-packages','email','sqlite3','lib-dynload','wsgiref','plat-linux2','plat-mac']
os.walk is a generator and calling next will get the first result in the form of a 3-tuple (dirpath, dirnames, filenames). Thus the [1] index returns only the dirnames from that tuple.
>>>>import timeit
>>>> timeit.timeit("os.walk('.').next()[1]","import os", number=10000)1.1215229034423828>>>> timeit.timeit("[ name for name in os.listdir('.') if os.path.isdir(os.path.join('.', name)) ]","import os", number=10000)1.0592019557952881
Just to add that using os.listdir() does not “take a lot of processing vs very simple os.walk().next()[1]”. This is because os.walk() uses os.listdir() internally. In fact if you test them together:
>>>> import timeit
>>>> timeit.timeit("os.walk('.').next()[1]", "import os", number=10000)
1.1215229034423828
>>>> timeit.timeit("[ name for name in os.listdir('.') if os.path.isdir(os.path.join('.', name)) ]", "import os", number=10000)
1.0592019557952881
The filtering of os.listdir() is very slightly faster.
回答 7
一种非常简单而优雅的方法是使用此方法:
import os
dir_list = os.walk('.').next()[1]print dir_list
A very much simpler and elegant way is to use this:
import os
dir_list = os.walk('.').next()[1]
print dir_list
Run this script in the same folder for which you want folder names.It will give you exactly the immediate folders name only(that too without the full path of the folders).
Python 3.4 introduced the pathlib module into the standard library, which provides an object oriented approach to handle filesystem paths:
from pathlib import Path
p = Path('./')
[f for f in p.iterdir() if f.is_dir()]
回答 16
--This will exclude files and traverse through 1 level of sub folders in the root
def list_files(dir):List=[]
filterstr =' 'for root, dirs, files in os.walk(dir, topdown =True):#r.append(root)if(root == dir):passelif filterstr in root:#filterstr = ' 'passelse:
filterstr = root
#print(root)for name in files:print(root)print(dirs)List.append(os.path.join(root,name))#print(os.path.join(root,name),"\n")print(List,"\n")returnList
-- This will exclude files and traverse through 1 level of sub folders in the root
def list_files(dir):
List = []
filterstr = ' '
for root, dirs, files in os.walk(dir, topdown = True):
#r.append(root)
if (root == dir):
pass
elif filterstr in root:
#filterstr = ' '
pass
else:
filterstr = root
#print(root)
for name in files:
print(root)
print(dirs)
List.append(os.path.join(root,name))
#print(os.path.join(root,name),"\n")
print(List,"\n")
return List
Suppose python code is executed in not known by prior windows directory say ‘main’ , and wherever code is installed when it runs it needs to access to directory ‘main/2091/data.txt’ .
how should I use open(location) function? what should be location ?
Edit :
I found that below simple code will work..does it have any disadvantages ?
import os
script_dir = os.path.dirname(__file__)#<-- absolute dir the script is in
rel_path ="2091/data.txt"
abs_file_path = os.path.join(script_dir, rel_path)
With this type of thing you need to be careful what your actual working directory is. For example, you may not run the script from the directory the file is in. In this case, you can’t just use a relative path by itself.
If you are sure the file you want is in a subdirectory beneath where the script is actually located, you can use __file__ to help you out here. __file__ is the full path to where the script you are running is located.
So you can fiddle with something like this:
import os
script_dir = os.path.dirname(__file__) #<-- absolute dir the script is in
rel_path = "2091/data.txt"
abs_file_path = os.path.join(script_dir, rel_path)
回答 1
这段代码可以正常工作:
import os
def readFile(filename):
filehandle = open(filename)print filehandle.read()
filehandle.close()
fileDir = os.path.dirname(os.path.realpath('__file__'))print fileDir
#For accessing the file in the same folder
filename ="same.txt"
readFile(filename)#For accessing the file in a folder contained in the current folder
filename = os.path.join(fileDir,'Folder1.1/same.txt')
readFile(filename)#For accessing the file in the parent folder of the current folder
filename = os.path.join(fileDir,'../same.txt')
readFile(filename)#For accessing the file inside a sibling folder.
filename = os.path.join(fileDir,'../Folder2/same.txt')
filename = os.path.abspath(os.path.realpath(filename))print filename
readFile(filename)
import os
def readFile(filename):
filehandle = open(filename)
print filehandle.read()
filehandle.close()
fileDir = os.path.dirname(os.path.realpath('__file__'))
print fileDir
#For accessing the file in the same folder
filename = "same.txt"
readFile(filename)
#For accessing the file in a folder contained in the current folder
filename = os.path.join(fileDir, 'Folder1.1/same.txt')
readFile(filename)
#For accessing the file in the parent folder of the current folder
filename = os.path.join(fileDir, '../same.txt')
readFile(filename)
#For accessing the file inside a sibling folder.
filename = os.path.join(fileDir, '../Folder2/same.txt')
filename = os.path.abspath(os.path.realpath(filename))
print filename
readFile(filename)
This is a great answer because it is trying to dynamically creates an absolute system path to the desired file.
Cory Mawhorter noticed that __file__ is a relative path (it is as well on my system) and suggested using os.path.abspath(__file__). os.path.abspath, however, returns the absolute path of your current script (i.e. /path/to/dir/foobar.py)
To use this method (and how I eventually got it working) you have to remove the script name from the end of the path:
import os
script_path = os.path.abspath(__file__) # i.e. /path/to/dir/foobar.py
script_dir = os.path.split(script_path)[0] #i.e. /path/to/dir/
rel_path = "2091/data.txt"
abs_file_path = os.path.join(script_dir, rel_path)
The resulting abs_file_path (in this example) becomes: /path/to/dir/2091/data.txt
回答 3
这取决于您使用的操作系统。如果您想要一个与Windows和* nix兼容的解决方案,例如:
from os import path
file_path = path.relpath("2091/data.txt")with open(file_path)as f:<do stuff>
It depends on what operating system you’re using. If you want a solution that is compatible with both Windows and *nix something like:
from os import path
file_path = path.relpath("2091/data.txt")
with open(file_path) as f:
<do stuff>
should work fine.
The path module is able to format a path for whatever operating system it’s running on. Also, python handles relative paths just fine, so long as you have correct permissions.
Edit:
As mentioned by kindall in the comments, python can convert between unix-style and windows-style paths anyway, so even simpler code will work:
with open("2091/data/txt") as f:
<do stuff>
That being said, the path module still has some useful functions.
I spend a lot time to discover why my code could not find my file running Python 3 on the Windows system. So I added . before / and everything worked fine:
Python just passes the filename you give it to the operating system, which opens it. If your operating system supports relative paths like main/2091/data.txt (hint: it does), then that will work fine.
You may find that the easiest way to answer a question like this is to try it and see what happens.
回答 10
import os
def file_path(relative_path):
dir = os.path.dirname(os.path.abspath(__file__))
split_path = relative_path.split("/")
new_path = os.path.join(dir,*split_path)return new_path
with open(file_path("2091/data.txt"),"w")as f:
f.write("Powerful you have become.")
import os
def file_path(relative_path):
dir = os.path.dirname(os.path.abspath(__file__))
split_path = relative_path.split("/")
new_path = os.path.join(dir, *split_path)
return new_path
with open(file_path("2091/data.txt"), "w") as f:
f.write("Powerful you have become.")
and this would raise an syntax error. I used get confused alot. Then after some surfing across google. found why the error occurred. Writing this for beginners
It’s because for path to be read in Unicode you simple add a \ when starting file path
numpy has three different functions which seem like they can be used for the same things — except that numpy.maximum can only be used element-wise, while numpy.max and numpy.amax can be used on particular axes, or all elements. Why is there more than just numpy.max? Is there some subtlety to this in performance?
np.max is just an alias for np.amax. This function only works on a single input array and finds the value of maximum element in that entire array (returning a scalar). Alternatively, it takes an axis argument and will find the maximum value along an axis of the input array (returning a new array).
>>> a = np.array([[0, 1, 6],
[2, 4, 1]])
>>> np.max(a)
6
>>> np.max(a, axis=0) # max of each column
array([2, 4, 6])
The default behaviour of np.maximum is to take two arrays and compute their element-wise maximum. Here, ‘compatible’ means that one array can be broadcast to the other. For example:
>>> b = np.array([3, 6, 1])
>>> c = np.array([4, 2, 9])
>>> np.maximum(b, c)
array([4, 6, 9])
But np.maximum is also a universal function which means that it has other features and methods which come in useful when working with multidimensional arrays. For example you can compute the cumulative maximum over an array (or a particular axis of the array):
You’ve already stated why np.maximum is different – it returns an array that is the element-wise maximum between two arrays.
As for np.amax and np.max: they both call the same function – np.max is just an alias for np.amax, and they compute the maximum of all elements in an array, or along an axis of an array.
In [1]: import numpy as np
In [2]: np.amax
Out[2]: <function numpy.core.fromnumeric.amax>
In [3]: np.max
Out[3]: <function numpy.core.fromnumeric.amax>
In a iPython notebook, I have a while loop that listens to a Serial port and print the received data in real time.
What I want to achieve to only show the latest received data (i.e only one line showing the most recent data. no scrolling in the cell output area)
What I need(i think) is to clear the old cell output when I receives new data, and then prints the new data. I am wondering how can I clear old data programmatically ?
from IPython.display import clear_output
for i in range(10):
clear_output(wait=True)
print("Hello World!")
At the end of this loop you will only see one Hello World!.
Without a code example it’s not easy to give you working code. Probably buffering the latest n events is a good strategy. Whenever the buffer changes you can clear the cell’s output and print the buffer again.
And in case you come here, like I did, looking to do the same thing for plots in a Julia notebook in Jupyter, using Plots, you can use:
IJulia.clear_output(true)
so for a kind of animated plot of multiple runs
if nrun==1
display(plot(x,y)) # first plot
else
IJulia.clear_output(true) # clear the window (as above)
display(plot!(x,y)) # plot! overlays the plot
end
Without the clear_output call, all plots appear separately.
You can use the IPython.display.clear_output to clear the output as mentioned in cel’s answer. I would add that for me the best solution was to use this combination of parameters to print without any “shakiness” of the notebook:
from IPython.display import clear_output
for i in range(10):
clear_output(wait=True)
print(i, flush=True)
I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?
I am trying to do the following for feature selection:
I change the type of the categorical features to ‘category’:
non_categorial_features = ['orig_destination_distance',
'srch_adults_cnt',
'srch_children_cnt',
'srch_rm_cnt',
'cnt']
for categorical_feature in list(train_small.columns):
if categorical_feature not in non_categorial_features:
train_small[categorical_feature] = train_small[categorical_feature].astype('category')
The problem is that the 3’rd part often get stuck, although I am using a strong machine.
Thus, without the one hot encoding I can’t do any feature selection, for determining the importance of the features.
What do you recommend?
回答 0
方法1:您可以在pandas数据框上使用get_dummies。
范例1:
import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)Out[]:
a b c01.00.00.010.01.00.020.00.01.031.00.00.0
范例2:
下面将把给定的列转换为一个热门列。使用前缀具有多个虚拟变量。
import pandas as pd
df = pd.DataFrame({'A':['a','b','a'],'B':['b','a','c']})
dfOut[]:
A B0 a b1 b a2 a c# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])# Drop column B as it is now encoded
df = df.drop('B',axis =1)# Join the encoded df
df = df.join(one_hot)
df Out[]:
A a b c0 a 0101 b 1002 a 001
import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]:
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
Example 2:
The following will transform a given column into one hot. Use prefix to have multiple dummies.
import pandas as pd
df = pd.DataFrame({
'A':['a','b','a'],
'B':['b','a','c']
})
df
Out[]:
A B
0 a b
1 b a
2 a c
# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df
Out[]:
A a b c
0 a 0 1 0
1 b 1 0 0
2 a 0 0 1
Approach 2: Use Scikit-learn
Using a OneHotEncoder has the advantage of being able to fit on some training data and then transform on some other data using the same instance. We also have handle_unknown to further control what the encoder does with unseen data.
Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.
Much easier to use Pandas for basic one-hot encoding. If you’re looking for more options you can use scikit-learn.
For basic one-hot encoding with Pandas you pass your data frame into the get_dummies function.
For example, if I have a dataframe called imdb_movies:
…and I want to one-hot encode the Rated column, I do this:
pd.get_dummies(imdb_movies.Rated)
This returns a new dataframe with a column for every “level” of rating that exists, along with either a 1 or 0 specifying the presence of that rating for a given observation.
Usually, we want this to be part of the original dataframe. In this case, we attach our new dummy coded frame onto the original frame using “column-binding.
We can column-bind by using Pandas concat function:
import numpy as np
nb_classes =6
data =[[2,3,4,0]]def indices_to_one_hot(data, nb_classes):"""Convert an iterable of indices to one-hot encoded labels."""
targets = np.array(data).reshape(-1)return np.eye(nb_classes)[targets]
You can do it with numpy.eye and a using the array element selection mechanism:
import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]
def indices_to_one_hot(data, nb_classes):
"""Convert an iterable of indices to one-hot encoded labels."""
targets = np.array(data).reshape(-1)
return np.eye(nb_classes)[targets]
The the return value of indices_to_one_hot(nb_classes, data) is now
from sklearn.preprocessing importLabelEncoder#Auto encodes any dataframe column of type category or object.def dummyEncode(df):
columnsToEncode = list(df.select_dtypes(include=['category','object']))
le =LabelEncoder()for feature in columnsToEncode:try:
df[feature]= le.fit_transform(df[feature])except:print('Error encoding '+feature)return df
编辑:比较要更清楚:
一键编码:将n个级别转换为n-1列。
IndexAnimalIndex cat mouse
1 dog 1002 cat -->2103 mouse 301
如果分类功能中有许多不同的类型(或级别),则可以看到这将如何扩展您的内存。请记住,这只是一栏。
虚拟编码:
IndexAnimalIndexAnimal1 dog 102 cat -->213 mouse 32
Lastly, is it necessary for you to one hot encode? One hot encoding exponentially increases the number of features, drastically increasing the run time of any classifier or anything else you are going to run. Especially when each categorical feature has many levels. Instead you can do dummy coding.
Using dummy encoding usually works well, for much less run time and complexity. A wise prof once told me, ‘Less is More’.
Here’s the code for my custom encoding function if you want.
from sklearn.preprocessing import LabelEncoder
#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
columnsToEncode = list(df.select_dtypes(include=['category','object']))
le = LabelEncoder()
for feature in columnsToEncode:
try:
df[feature] = le.fit_transform(df[feature])
except:
print('Error encoding '+feature)
return df
EDIT: Comparison to be clearer:
One-hot encoding: convert n levels to n-1 columns.
Index Animal Index cat mouse
1 dog 1 0 0
2 cat --> 2 1 0
3 mouse 3 0 1
You can see how this will explode your memory if you have many different types (or levels) in your categorical feature. Keep in mind, this is just ONE column.
Dummy Coding:
Index Animal Index Animal
1 dog 1 0
2 cat --> 2 1
3 mouse 3 2
Convert to numerical representations instead. Greatly saves feature space, at the cost of a bit of accuracy.
回答 4
使用熊猫进行热编码非常简单:
def one_hot(df, cols):"""
@param df pandas DataFrame
@param cols a list of columns to encode
@return a DataFrame with one-hot encoding
"""for each in cols:
dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
df = pd.concat([df, dummies], axis=1)return df
编辑:
使用sklearn的另一种方式one_hot LabelBinarizer:
from sklearn.preprocessing importLabelBinarizer
label_binarizer =LabelBinarizer()
label_binarizer.fit(all_your_labels_list)# need to be global or remembered to use it laterdef one_hot_encode(x):"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""return label_binarizer.transform(x)
def one_hot(df, cols):
"""
@param df pandas DataFrame
@param cols a list of columns to encode
@return a DataFrame with one-hot encoding
"""
for each in cols:
dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
df = pd.concat([df, dummies], axis=1)
return df
EDIT:
Another way to one_hot using sklearn’s LabelBinarizer :
from sklearn.preprocessing import LabelBinarizer
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later
def one_hot_encode(x):
"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""
return label_binarizer.transform(x)
回答 5
您可以使用numpy.eye函数。
import numpy as np
def one_hot_encode(x, n_classes):"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""return np.eye(n_classes)[x]def main():
list =[0,1,2,3,4,3,2,1,0]
n_classes =5
one_hot_list = one_hot_encode(list, n_classes)print(one_hot_list)if __name__ =="__main__":
main()
import numpy as np
def one_hot_encode(x, n_classes):
"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""
return np.eye(n_classes)[x]
def main():
list = [0,1,2,3,4,3,2,1,0]
n_classes = 5
one_hot_list = one_hot_encode(list, n_classes)
print(one_hot_list)
if __name__ == "__main__":
main()
One-hot encoding requires bit more than converting the values to indicator variables. Typically ML process requires you to apply this coding several times to validation or test data sets and applying the model you construct to real-time observed data. You should store the mapping (transform) that was used to construct the model. A good solution would use the DictVectorizer or LabelEncoder (followed by get_dummies. Here is a function that you can use:
This works on a pandas dataframe and for each column of the dataframe it creates and returns a mapping back. So you would call it like this:
train_data, le_dict = oneHotEncode2(train_data)
Then on the test data, the call is made by passing the dictionary returned back from training:
test_data, _ = oneHotEncode2(test_data, le_dict)
An equivalent method is to use DictVectorizer. A related post on the same is on my blog. I mention it here since it provides some reasoning behind this approach over simply using get_dummies post (disclosure: this is my own blog).
You can pass the data to catboost classifier without encoding. Catboost handles categorical variables itself by performing one-hot and target expanding mean encoding.
回答 10
您也可以执行以下操作。请注意以下内容,您不必使用pd.concat。
import pandas as pd
# intialise data of lists.
data ={'Color':['Red','Yellow','Red','Yellow'],'Length':[20.1,21.1,19.1,18.1],'Group':[1,2,1,2]}# Create DataFrame
df = pd.DataFrame(data)for _c in df.select_dtypes(include=['object']).columns:print(_c)
df[_c]= pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed
您还可以将显式列更改为分类。例如,在这里我要更改Color和Group
import pandas as pd
# intialise data of lists.
data ={'Color':['Red','Yellow','Red','Yellow'],'Length':[20.1,21.1,19.1,18.1],'Group':[1,2,1,2]}# Create DataFrame
df = pd.DataFrame(data)
columns_to_change = list(df.select_dtypes(include=['object']).columns)
columns_to_change.append('Group')for _c in columns_to_change:print(_c)
df[_c]= pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed
classOneHotEncoder:def __init__(self,optionKeys):
length=len(optionKeys)
self.__dict__={optionKeys[j]:[0if i!=j else1for i in range(length)]for j in range(length)}
class OneHotEncoder:
def __init__(self,optionKeys):
length=len(optionKeys)
self.__dict__={optionKeys[j]:[0 if i!=j else 1 for i in range(length)] for j in range(length)}
def one_hot_encode(y):"""Convert an iterable of indices to one-hot encoded labels."""
y = y.flatten()# Sometimes not flattened vector is passed e.g (118,1) in these cases# the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
nb_classes = len(np.unique(y))# get the number of unique classes
standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes)))# get the class labels as a dictionary# which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is# directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.# standardised labels fixes this issue by returning a dictionary;# standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.# standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element# cannot be called by an integer index e.g y[1.0] - throws an index error.
targets = np.vectorize(standardised_labels.get)(y)# map the dictionary values to array.return np.eye(nb_classes)[targets]
def one_hot_encode(y):
"""Convert an iterable of indices to one-hot encoded labels."""
y = y.flatten() # Sometimes not flattened vector is passed e.g (118,1) in these cases
# the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
nb_classes = len(np.unique(y)) # get the number of unique classes
standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
# which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is
# directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.
# standardised labels fixes this issue by returning a dictionary;
# standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.
# standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element
# cannot be called by an integer index e.g y[1.0] - throws an index error.
targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
return np.eye(nb_classes)[targets]
import typing
def one_hot_encode(items: list)-> typing.List[list]:
results =[]# find the unique items (we want to unique items b/c duplicate items will have the same encoding)
unique_items = list(set(items))# sort the unique items
sorted_items = sorted(unique_items)# find how long the list of each item should be
max_index = len(unique_items)for item in items:# create a list of zeros the appropriate length
one_hot_encoded_result =[0for i in range(0, max_index)]# find the index of the item
one_hot_index = sorted_items.index(item)# change the zero at the index from the previous line to a one
one_hot_encoded_result[one_hot_index]=1# add the result
results.append(one_hot_encoded_result)return results
Here is a function to do one-hot-encoding without using numpy, pandas, or other packages. It takes a list of integers, booleans, or strings (and perhaps other types too).
import typing
def one_hot_encode(items: list) -> typing.List[list]:
results = []
# find the unique items (we want to unique items b/c duplicate items will have the same encoding)
unique_items = list(set(items))
# sort the unique items
sorted_items = sorted(unique_items)
# find how long the list of each item should be
max_index = len(unique_items)
for item in items:
# create a list of zeros the appropriate length
one_hot_encoded_result = [0 for i in range(0, max_index)]
# find the index of the item
one_hot_index = sorted_items.index(item)
# change the zero at the index from the previous line to a one
one_hot_encoded_result[one_hot_index] = 1
# add the result
results.append(one_hot_encoded_result)
return results
I know there are already a lot of answers to this question, but I noticed two things. First, most of the answers use packages like numpy and/or pandas. And this is a good thing. If you are writing production code, you should probably be using robust, fast algorithms like those provided in the numpy/pandas packages. But, for the sake of education, I think someone should provide an answer which has a transparent algorithm and not just an implementation of someone else’s algorithm. Second, I noticed that many of the answers do not provide a robust implementation of one-hot encoding because they do not meet one of the requirements below. Below are some of the requirements (as I see them) for a useful, accurate, and robust one-hot encoding function:
A one-hot encoding function must:
handle list of various types (e.g. integers, strings, floats, etc.) as input
handle an input list with duplicates
return a list of lists corresponding (in the same order as) to the inputs
return a list of lists where each list is as short as possible
I tested many of the answers to this question and most of them fail on one of the requirements above.
回答 18
试试这个:
!pip install category_encoders
import category_encoders as ce
categorical_columns =[...the list of names of the columns you want to one-hot-encode ...]
encoder = ce.OneHotEncoder(cols=categorical_columns, use_cat_names=True)
df_train_encoded = encoder.fit_transform(df_train_small)
!pip install category_encoders
import category_encoders as ce
categorical_columns = [...the list of names of the columns you want to one-hot-encode ...]
encoder = ce.OneHotEncoder(cols=categorical_columns, use_cat_names=True)
df_train_encoded = encoder.fit_transform(df_train_small)
df_encoded.head()
The resulting dataframe df_train_encoded is the same as the original, but the categorical features are now replaced with their one-hot-encoded versions.
I have a simple problem, but I cannot find a good solution to it.
I want to take a NumPy 2D array which represents a grayscale image, and convert it to an RGB PIL image while applying some of the matplotlib colormaps.
I can get a reasonable PNG output by using the pyplot.figure.figimage command:
Although I could adapt this to get what I want (probably using StringIO do get the PIL image), I wonder if there is not a simpler way to do that, since it seems to be a very natural problem of image visualization. Let’s say, something like this:
colored_PIL_image = magic_function(array, cmap)
回答 0
一行代码很忙,但是这里是:
首先,请确保您的NumPy数组myarray使用处的最大值进行了规范化1.0。
将颜色表直接应用于myarray。
重新调整0-255范围。
使用转换为整数np.uint8()。
使用Image.fromarray()。
这样就完成了:
from PIL importImagefrom matplotlib import cm
im =Image.fromarray(np.uint8(cm.gist_earth(myarray)*255))
First ensure your NumPy array, myarray, is normalised with the max value at 1.0.
Apply the colormap directly to myarray.
Rescale to the 0-255 range.
Convert to integers, using np.uint8().
Use Image.fromarray().
And you’re done:
from PIL import Image
from matplotlib import cm
im = Image.fromarray(np.uint8(cm.gist_earth(myarray)*255))
with plt.savefig():
with im.save():
回答 1
输入= numpy_image
np.unit8->转换为整数
convert(’RGB’)->转换为RGB
Image.fromarray->返回图像对象
from PIL importImageimport numpy as np
PIL_image =Image.fromarray(np.uint8(numpy_image)).convert('RGB')
PIL_image =Image.fromarray(numpy_image.astype('uint8'),'RGB')
The method described in the accepted answer didn’t work for me even after applying changes mentioned in its comments. But the below simple code worked:
import matplotlib.pyplot as plt
plt.imsave(filename, np_array, cmap='Greys')
np_array could be either a 2D array with values from 0..1 floats o2 0..255 uint8, and in that case it needs cmap. For 3D arrays, cmap will be ignored.
I want to select all values from the ‘First Season’ column and replace those that are over 1990 by 1. In this example, only Baltimore Ravens would have the 1996 replaced by 1 (keeping the rest of the data intact).
I have used the following:
df.loc[(df['First Season'] > 1990)] = 1
But, it replaces all the values in that row by 1, and not just the values in the ‘First Season’ column.
How can I replace just the values from that column?
In [41]:
df.loc[df['First Season'] > 1990, 'First Season'] = 1
df
Out[41]:
Team First Season Total Games
0 Dallas Cowboys 1960 894
1 Chicago Bears 1920 1357
2 Green Bay Packers 1921 1339
3 Miami Dolphins 1966 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 1950 1003
So the syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
If you want to generate a boolean indicator then you can just use the boolean condition to generate a boolean Series and cast the dtype to int this will convert True and False to 1 and 0 respectively:
In [43]:
df['First Season'] = (df['First Season'] > 1990).astype(int)
df
Out[43]:
Team First Season Total Games
0 Dallas Cowboys 0 894
1 Chicago Bears 0 1357
2 Green Bay Packers 0 1339
3 Miami Dolphins 0 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 0 1003
回答 1
聚会晚了一点,但仍然-我更喜欢在以下地方使用numpy:
import numpy as np
df['First Season']= np.where(df['First Season']>1990,1, df['First Season'])
strange that nobody has this answer, the only missing part of your code is the [‘First Season’] right after df and just remove your curly brackets inside.
回答 3
对于单一条件,即。 ( 'employrate'] > 70 )
country employrate alcconsumption
0Afghanistan55.7000007629394.031Albania51.40000152587897.292Algeria50.5.693Andorra10.174Angola75.69999694824225.57
用这个:
df.loc[df['employrate']>70,'employrate']=7
country employrate alcconsumption
0Afghanistan55.700001.031Albania51.4000027.292Algeria50.500000.693Andorra nan 10.174Angola7.0000005.57
因此,语法如下:
df.loc[<mask>(here mask is generating the labels to index),<optional column(s)>]
df.loc takes two arguments, ‘row index’ and ‘column index’. We are checking if the value is greater than 27 of each row value, under “First season” column and then we replacing it with 1.
So I’ve followed this tutorial but it doesn’t seem to do anything. Simply nothing. It waits a few seconds and closes the program. What is wrong with this code?
import cv2
vidcap = cv2.VideoCapture('Compton.mp4')
success,image = vidcap.read()
count = 0
success = True
while success:
success,image = vidcap.read()
cv2.imwrite("frame%d.jpg" % count, image) # save frame as JPEG file
if cv2.waitKey(10) == 27: # exit if Escape is hit
break
count += 1
Also, in the comments it says that this limits the frames to 1000? Why?
EDIT:
I tried doing success = True first but that didn’t help. It only created one image that was 0 bytes.
From here download this video so we have the same video file for the test. Make sure to have that mp4 file in the same directory of your python code. Then also make sure to run the python interpreter from the same directory.
Then modify the code, ditch waitKey that’s wasting time also without a window it cannot capture the keyboard events. Also we print the success value to make sure it’s reading the frames successfully.
import cv2
vidcap = cv2.VideoCapture('big_buck_bunny_720p_5mb.mp4')
success,image = vidcap.read()
count = 0
while success:
cv2.imwrite("frame%d.jpg" % count, image) # save frame as JPEG file
success,image = vidcap.read()
print('Read a new frame: ', success)
count += 1
To extend on this question (& answer by @user2700065) for a slightly different cases, if anyone does not want to extract every frame but wants to extract frame every one second. So a 1-minute video will give 60 frames(images).
import sys
import argparse
import cv2
print(cv2.__version__)
def extractImages(pathIn, pathOut):
count = 0
vidcap = cv2.VideoCapture(pathIn)
success,image = vidcap.read()
success = True
while success:
vidcap.set(cv2.CAP_PROP_POS_MSEC,(count*1000)) # added this line
success,image = vidcap.read()
print ('Read a new frame: ', success)
cv2.imwrite( pathOut + "\\frame%d.jpg" % count, image) # save frame as JPEG file
count = count + 1
if __name__=="__main__":
a = argparse.ArgumentParser()
a.add_argument("--pathIn", help="path to video")
a.add_argument("--pathOut", help="path to images")
args = a.parse_args()
print(args)
extractImages(args.pathIn, args.pathOut)
回答 2
这是来自@GShocked的python 3.x以前答案的调整,我将其发布到注释中,但信誉不足
import sys
import argparse
import cv2
print(cv2.__version__)def extractImages(pathIn, pathOut):
vidcap = cv2.VideoCapture(pathIn)
success,image = vidcap.read()
count =0
success =Truewhile success:
success,image = vidcap.read()print('Read a new frame: ', success)
cv2.imwrite( pathOut +"\\frame%d.jpg"% count, image)# save frame as JPEG file
count +=1if __name__=="__main__":print("aba")
a = argparse.ArgumentParser()
a.add_argument("--pathIn", help="path to video")
a.add_argument("--pathOut", help="path to images")
args = a.parse_args()print(args)
extractImages(args.pathIn, args.pathOut)
This is a tweak from previous answer for python 3.x from @GShocked, I would post it to the comment, but dont have enough reputation
import sys
import argparse
import cv2
print(cv2.__version__)
def extractImages(pathIn, pathOut):
vidcap = cv2.VideoCapture(pathIn)
success,image = vidcap.read()
count = 0
success = True
while success:
success,image = vidcap.read()
print ('Read a new frame: ', success)
cv2.imwrite( pathOut + "\\frame%d.jpg" % count, image) # save frame as JPEG file
count += 1
if __name__=="__main__":
print("aba")
a = argparse.ArgumentParser()
a.add_argument("--pathIn", help="path to video")
a.add_argument("--pathOut", help="path to images")
args = a.parse_args()
print(args)
extractImages(args.pathIn, args.pathOut)
回答 3
此功能可将大多数视频格式转换为视频中的帧数。它的工作原理上Python3与OpenCV 3+
import cv2
import time
import os
def video_to_frames(input_loc, output_loc):"""Function to extract frames from input video file
and save them as separate frames in an output directory.
Args:
input_loc: Input video file.
output_loc: Output directory to save the frames.
Returns:
None
"""try:
os.mkdir(output_loc)exceptOSError:pass# Log the time
time_start = time.time()# Start capturing the feed
cap = cv2.VideoCapture(input_loc)# Find the number of frames
video_length = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))-1print("Number of frames: ", video_length)
count =0print("Converting video..\n")# Start converting the videowhile cap.isOpened():# Extract the frame
ret, frame = cap.read()# Write the results back to output location.
cv2.imwrite(output_loc +"/%#05d.jpg"%(count+1), frame)
count = count +1# If there are no more frames leftif(count >(video_length-1)):# Log the time again
time_end = time.time()# Release the feed
cap.release()# Print statsprint("Done extracting frames.\n%d frames extracted"% count)print("It took %d seconds forconversion."%(time_end-time_start))breakif __name__=="__main__":
input_loc ='/path/to/video/00009.MTS'
output_loc ='/path/to/output/frames/'
video_to_frames(input_loc, output_loc)
This is Function which will convert most of the video formats to number of frames there are in the video. It works on Python3 with OpenCV 3+
import cv2
import time
import os
def video_to_frames(input_loc, output_loc):
"""Function to extract frames from input video file
and save them as separate frames in an output directory.
Args:
input_loc: Input video file.
output_loc: Output directory to save the frames.
Returns:
None
"""
try:
os.mkdir(output_loc)
except OSError:
pass
# Log the time
time_start = time.time()
# Start capturing the feed
cap = cv2.VideoCapture(input_loc)
# Find the number of frames
video_length = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) - 1
print ("Number of frames: ", video_length)
count = 0
print ("Converting video..\n")
# Start converting the video
while cap.isOpened():
# Extract the frame
ret, frame = cap.read()
# Write the results back to output location.
cv2.imwrite(output_loc + "/%#05d.jpg" % (count+1), frame)
count = count + 1
# If there are no more frames left
if (count > (video_length-1)):
# Log the time again
time_end = time.time()
# Release the feed
cap.release()
# Print stats
print ("Done extracting frames.\n%d frames extracted" % count)
print ("It took %d seconds forconversion." % (time_end-time_start))
break
if __name__=="__main__":
input_loc = '/path/to/video/00009.MTS'
output_loc = '/path/to/output/frames/'
video_to_frames(input_loc, output_loc)
It supports .mts and normal files like .mp4 and .avi. Tried and Tested on .mts files. Works like a Charm.
After a lot of research on how to convert frames to video I have created this function hope this helps. We require opencv for this:
import cv2
import numpy as np
import os
def frames_to_video(inputpath,outputpath,fps):
image_array = []
files = [f for f in os.listdir(inputpath) if isfile(join(inputpath, f))]
files.sort(key = lambda x: int(x[5:-4]))
for i in range(len(files)):
img = cv2.imread(inputpath + files[i])
size = (img.shape[1],img.shape[0])
img = cv2.resize(img,size)
image_array.append(img)
fourcc = cv2.VideoWriter_fourcc('D', 'I', 'V', 'X')
out = cv2.VideoWriter(outputpath,fourcc, fps, size)
for i in range(len(image_array)):
out.write(image_array[i])
out.release()
inputpath = 'folder path'
outpath = 'video file path/video.mp4'
fps = 29
frames_to_video(inputpath,outpath,fps)
change the value of fps(frames per second),input folder path and output folder path according to your own local locations
回答 5
先前的答案丢失了第一帧。而且最好将图像存储在文件夹中。
# create a folder to store extracted imagesimport os
folder ='test'
os.mkdir(folder)# use opencv to do the jobimport cv2
print(cv2.__version__)# my version is 3.1.0
vidcap = cv2.VideoCapture('test_video.mp4')
count =0whileTrue:
success,image = vidcap.read()ifnot success:break
cv2.imwrite(os.path.join(folder,"frame{:d}.jpg".format(count)), image)# save frame as JPEG file
count +=1print("{} images are extacted in {}.".format(count,folder))
The previous answers have lost the first frame. And it will be nice to store the images in a folder.
# create a folder to store extracted images
import os
folder = 'test'
os.mkdir(folder)
# use opencv to do the job
import cv2
print(cv2.__version__) # my version is 3.1.0
vidcap = cv2.VideoCapture('test_video.mp4')
count = 0
while True:
success,image = vidcap.read()
if not success:
break
cv2.imwrite(os.path.join(folder,"frame{:d}.jpg".format(count)), image) # save frame as JPEG file
count += 1
print("{} images are extacted in {}.".format(count,folder))
By the way, you can check the frame rate by VLC. Go to windows -> media information -> codec details
回答 6
此代码从视频中提取帧并将帧保存为.jpg formate
import cv2
import numpy as np
import os
# set video file path of input video with name and extension
vid = cv2.VideoCapture('VideoPath')ifnot os.path.exists('images'):
os.makedirs('images')#for frame identity
index =0while(True):# Extract images
ret, frame = vid.read()# end of framesifnot ret:break# Saves images
name ='./images/frame'+ str(index)+'.jpg'print('Creating...'+ name)
cv2.imwrite(name, frame)# next frame
index +=1
This code extract frames from the video and save the frames in .jpg formate
import cv2
import numpy as np
import os
# set video file path of input video with name and extension
vid = cv2.VideoCapture('VideoPath')
if not os.path.exists('images'):
os.makedirs('images')
#for frame identity
index = 0
while(True):
# Extract images
ret, frame = vid.read()
# end of frames
if not ret:
break
# Saves images
name = './images/frame' + str(index) + '.jpg'
print ('Creating...' + name)
cv2.imwrite(name, frame)
# next frame
index += 1
I am using Python via Anaconda’s Spyder software. Using the original code listed in the question of this thread by @Gshocked, the code does not work (the python won’t read the mp4 file). So I downloaded OpenCV 3.2 and copied “opencv_ffmpeg320.dll” and “opencv_ffmpeg320_64.dll” from the “bin” folder. I pasted both of these dll files to Anaconda’s “Dlls” folder.
Anaconda also has a “pckgs” folder…I copied and pasted the entire “OpenCV 3.2” folder that I downloaded to the Anaconda “pckgs” folder.
Finally, Anaconda has a “Library” folder which has a “bin” subfolder. I pasted the “opencv_ffmpeg320.dll” and “opencv_ffmpeg320_64.dll” files to that folder.
After closing and restarting Spyder, the code worked. I’m not sure which of the three methods worked, and I’m too lazy to go back and figure it out. But it works so, cheers!
回答 8
此功能以1 fps的速度从视频中提取图像,此外它还标识最后一帧并停止读取:
import cv2
import numpy as np
def extract_image_one_fps(video_source_path):
vidcap = cv2.VideoCapture(video_source_path)
count =0
success =Truewhile success:
vidcap.set(cv2.CAP_PROP_POS_MSEC,(count*1000))
success,image = vidcap.read()## Stop when last frame is identified
image_last = cv2.imread("frame{}.png".format(count-1))if np.array_equal(image,image_last):break
cv2.imwrite("frame%d.png"% count, image)# save frame as PNG fileprint'{}.sec reading a new frame: {} '.format(count,success)
count +=1
Following script will extract frames every half a second of all videos in folder. (Works on python 3.7)
import cv2
import os
listing = os.listdir(r'D:/Images/AllVideos')
count=1
for vid in listing:
vid = r"D:/Images/AllVideos/"+vid
vidcap = cv2.VideoCapture(vid)
def getFrame(sec):
vidcap.set(cv2.CAP_PROP_POS_MSEC,sec*1000)
hasFrames,image = vidcap.read()
if hasFrames:
cv2.imwrite("D:/Images/Frames/image"+str(count)+".jpg", image) # Save frame as JPG file
return hasFrames
sec = 0
frameRate = 0.5 # Change this number to 1 for each 1 second
success = getFrame(sec)
while success:
count = count + 1
sec = sec + frameRate
sec = round(sec, 2)
success = getFrame(sec)
I want to calculate the column wise mean of a data frame.
This is easy:
df.apply(average)
then the column wise range max(col) – min(col). This is easy again:
df.apply(max) - df.apply(min)
Now for each element I want to subtract its column’s mean and divide by its column’s range. I am not sure how to do that
Any help/pointers are much appreciated.
回答 0
In[92]: df
Out[92]:
a b c d
A -0.4888160.8637694.325608-4.721202
B -11.9370972.993993-12.916784-1.086236
C -5.5694934.672679-2.168464-9.315900
D 8.8923680.9327854.5353960.598124In[93]: df_norm =(df - df.mean())/(df.max()- df.min())In[94]: df_norm
Out[94]:
a b c d
A 0.085789-0.3943480.337016-0.109935
B -0.4638300.164926-0.6509630.256714
C -0.1581290.605652-0.035090-0.573389
D 0.536170-0.3762290.3490370.426611In[95]: df_norm.mean()Out[95]:
a -2.081668e-17
b 4.857226e-17
c 1.734723e-17
d -1.040834e-17In[96]: df_norm.max()- df_norm.min()Out[96]:
a 1
b 1
c 1
d 1
In [92]: df
Out[92]:
a b c d
A -0.488816 0.863769 4.325608 -4.721202
B -11.937097 2.993993 -12.916784 -1.086236
C -5.569493 4.672679 -2.168464 -9.315900
D 8.892368 0.932785 4.535396 0.598124
In [93]: df_norm = (df - df.mean()) / (df.max() - df.min())
In [94]: df_norm
Out[94]:
a b c d
A 0.085789 -0.394348 0.337016 -0.109935
B -0.463830 0.164926 -0.650963 0.256714
C -0.158129 0.605652 -0.035090 -0.573389
D 0.536170 -0.376229 0.349037 0.426611
In [95]: df_norm.mean()
Out[95]:
a -2.081668e-17
b 4.857226e-17
c 1.734723e-17
d -1.040834e-17
In [96]: df_norm.max() - df_norm.min()
Out[96]:
a 1
b 1
c 1
d 1
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randn(4,4)*4+3)012309.4973810.5529740.887313-1.29187416.461631-6.2061559.979247-0.04482824.2761562.0025188.848432-5.24056331.7103311.4637837.535078-1.399565
df.apply(lambda x:(x - np.mean(x))/(np.max(x)- np.min(x)))012300.5150870.133967-0.6516990.13517510.125241-0.6894460.3483010.3751882-0.1554140.3105540.223925-0.6248123-0.4849130.2449240.0794730.114448
此外,groupby如果您选择相关列,它也可以与配合使用:
df['grp']=['A','A','B','B']0123 grp
09.4973810.5529740.887313-1.291874 A
16.461631-6.2061559.979247-0.044828 A
24.2761562.0025188.848432-5.240563 B
31.7103311.4637837.535078-1.399565 B
df.groupby(['grp'])[[0,1,2,3]].apply(lambda x:(x - np.mean(x))/(np.max(x)- np.min(x)))012300.50.5-0.5-0.51-0.5-0.50.50.520.50.50.5-0.53-0.5-0.5-0.50.5
defNormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):if low=='min':
low=min(s)elif low=='abs':
low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))if hi=='max':
hi=max(s)elif hi=='abs':
hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))if center=='mid':
center=(max(s)+min(s))/2elif center=='avg':
center=mean(s)elif center=='median':
center=median(s)
s2=[x-center for x in s]
hi=hi-center
low=low-center
center=0.
r=[]for x in s2:if x<low:
r.append(0.)elif x>hi:
r.append(1.)else:if x>=center:
r.append((x-center)/(hi-center)*0.5+0.5)else:
r.append((x-low)/(center-low)*0.5+0.)if insideout==True:
ir=[(1.-abs(z-0.5)*2.)for z in r]
r=ir
rr =[x-(x-0.5)*shrinkfactor for x in r]return rr
I wanted customized normalization in that regular percentile of datum or z-score was not adequate. Sometimes I knew what the feasible max and min of the population were, and therefore wanted to define it other than my sample, or a different midpoint, or whatever! This can often be useful for rescaling and normalizing data for neural nets where you may want all inputs between 0 and 1, but some of your data may need to be scaled in a more customized way… because percentiles and stdevs assumes your sample covers the population, but sometimes we know this isn’t true. It was also very useful for me when visualizing data in heatmaps. So i built a custom function (used extra steps in the code here to make it as readable as possible):
def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):
if low=='min':
low=min(s)
elif low=='abs':
low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))
if hi=='max':
hi=max(s)
elif hi=='abs':
hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))
if center=='mid':
center=(max(s)+min(s))/2
elif center=='avg':
center=mean(s)
elif center=='median':
center=median(s)
s2=[x-center for x in s]
hi=hi-center
low=low-center
center=0.
r=[]
for x in s2:
if x<low:
r.append(0.)
elif x>hi:
r.append(1.)
else:
if x>=center:
r.append((x-center)/(hi-center)*0.5+0.5)
else:
r.append((x-low)/(center-low)*0.5+0.)
if insideout==True:
ir=[(1.-abs(z-0.5)*2.) for z in r]
r=ir
rr =[x-(x-0.5)*shrinkfactor for x in r]
return rr
This will take in a pandas series, or even just a list and normalize it to your specified low, center, and high points. also there is a shrink factor! to allow you to scale down the data away from endpoints 0 and 1 (I had to do this when combining colormaps in matplotlib:Single pcolormesh with more than one colormap using Matplotlib) So you can likely see how the code works, but basically say you have values [-5,1,10] in a sample, but want to normalize based on a range of -7 to 7 (so anything above 7, our “10” is treated as a 7 effectively) with a midpoint of 2, but shrink it to fit a 256 RGB colormap:
It can also turn your data inside out… this may seem odd, but I found it useful for heatmapping. Say you want a darker color for values closer to 0 rather than hi/low. You could heatmap based on normalized data where insideout=True:
Previously, I asked How to get data received in Flask request because request.data was empty. The answer explained that request.data is the raw post body, but will be empty if form data is parsed. How can I get the raw post body unconditionally?
@app.route('/', methods=['POST'])
def parse_request():
data = request.data # empty in some cases
# always need raw data here, not parsed form data
Use request.get_data() to get the raw data, regardless of content type. The data is cached and you can subsequently access request.data, request.json, request.form at will.
If you access request.data first, it will call get_data with an argument to parse form data first. If the request has a form content type (multipart/form-data, application/x-www-form-urlencoded, or application/x-url-encoded) then the raw data will be consumed. request.data and request.json will appear empty in this case.
request.stream is the stream of raw data passed to the application by the WSGI server. No parsing is done when reading it, although you usually want request.get_data() instead.
data = request.stream.read()
The stream will be empty if it was previously read by request.data or another attribute.
from io importBytesIOclassWSGICopyBody(object):def __init__(self, application):
self.application = application
def __call__(self, environ, start_response):
length = int(environ.get('CONTENT_LENGTH')or0)
body = environ['wsgi.input'].read(length)
environ['body_copy']= body
# replace the stream since it was exhausted by read()
environ['wsgi.input']=BytesIO(body)return self.application(environ, start_response)
app.wsgi_app =WSGICopyBody(app.wsgi_app)
I created a WSGI middleware that stores the raw body from the environ['wsgi.input'] stream. I saved the value in the WSGI environ so I could access it from request.environ['body_copy'] within my app.
This isn’t necessary in Werkzeug or Flask, as request.get_data() will get the raw data regardless of content type, but with better handling of HTTP and WSGI behavior.
This reads the entire body into memory, which will be an issue if for example a large file is posted. This won’t read anything if the Content-Length header is missing, so it won’t handle streaming requests.
from io import BytesIO
class WSGICopyBody(object):
def __init__(self, application):
self.application = application
def __call__(self, environ, start_response):
length = int(environ.get('CONTENT_LENGTH') or 0)
body = environ['wsgi.input'].read(length)
environ['body_copy'] = body
# replace the stream since it was exhausted by read()
environ['wsgi.input'] = BytesIO(body)
return self.application(environ, start_response)
app.wsgi_app = WSGICopyBody(app.wsgi_app)
request.data will be empty if request.headers["Content-Type"] is recognized as form data, which will be parsed into request.form. To get the raw data regardless of content type, use request.get_data().
request.data calls request.get_data(parse_form_data=True), which results in the different behavior for form data.