Python 实用宝典

Question 1

I want to be able to list only the directories inside some folder. This means I don’t want filenames listed, nor do I want additional sub-folders.

Let’s see if an example helps. In the current directory we have:

>>> os.listdir(os.getcwd())
['cx_Oracle-doc', 'DLLs', 'Doc', 'include', 'Lib', 'libs', 'LICENSE.txt', 'mod_p
ython-wininst.log', 'NEWS.txt', 'pymssql-wininst.log', 'python.exe', 'pythonw.ex
e', 'README.txt', 'Removemod_python.exe', 'Removepymssql.exe', 'Scripts', 'tcl',
 'Tools', 'w9xpopen.exe']

However, I don’t want filenames listed. Nor do I want sub-folders such as \Lib\curses. Essentially what I want works with the following:

>>> for root, dirnames, filenames in os.walk('.'):
...     print dirnames
...     break
...
['cx_Oracle-doc', 'DLLs', 'Doc', 'include', 'Lib', 'libs', 'Scripts', 'tcl', 'Tools']

However, I’m wondering if there’s a simpler way of achieving the same results. I get the impression that using os.walk only to return the top level is inefficient/too much.

Question 2

Filter the result using os.path.isdir() (and use os.path.join() to get the real path):

>>> [ name for name in os.listdir(thedir) if os.path.isdir(os.path.join(thedir, name)) ]
['ctypes', 'distutils', 'encodings', 'lib-tk', 'config', 'idlelib', 'xml', 'bsddb', 'hotshot', 'logging', 'doc', 'test', 'compiler', 'curses', 'site-packages', 'email', 'sqlite3', 'lib-dynload', 'wsgiref', 'plat-linux2', 'plat-mac']

Question 3

os.walk

Use os.walk with next item function:

next(os.walk('.'))[1]

For Python <=2.5 use:

os.walk('.').next()[1]

How this works

os.walk is a generator and calling next will get the first result in the form of a 3-tuple (dirpath, dirnames, filenames). Thus the [1] index returns only the dirnames from that tuple.

Question 4

Filter the list using os.path.isdir to detect directories.

filter(os.path.isdir, os.listdir(os.getcwd()))

Question 5

directories=[d for d in os.listdir(os.getcwd()) if os.path.isdir(d)]

Question 6

Note that, instead of doing os.listdir(os.getcwd()), it’s preferable to do os.listdir(os.path.curdir). One less function call, and it’s as portable.

So, to complete the answer, to get a list of directories in a folder:

def listdirs(folder):
    return [d for d in os.listdir(folder) if os.path.isdir(os.path.join(folder, d))]

If you prefer full pathnames, then use this function:

def listdirs(folder):
    return [
        d for d in (os.path.join(folder, d1) for d1 in os.listdir(folder))
        if os.path.isdir(d)
    ]

Question 7

This seems to work too (at least on linux):

import glob, os
glob.glob('*' + os.path.sep)

Question 8

Just to add that using os.listdir() does not “take a lot of processing vs very simple os.walk().next()[1]”. This is because os.walk() uses os.listdir() internally. In fact if you test them together:

>>>> import timeit
>>>> timeit.timeit("os.walk('.').next()[1]", "import os", number=10000)
1.1215229034423828
>>>> timeit.timeit("[ name for name in os.listdir('.') if os.path.isdir(os.path.join('.', name)) ]", "import os", number=10000)
1.0592019557952881

The filtering of os.listdir() is very slightly faster.

Question 9

A very much simpler and elegant way is to use this:

 import os
 dir_list = os.walk('.').next()[1]
 print dir_list

Run this script in the same folder for which you want folder names.It will give you exactly the immediate folders name only(that too without the full path of the folders).

Question 10

Using list comprehension,

[a for a in os.listdir() if os.path.isdir(a)]

I think It is the simplest way

Question 11

being a newbie here i can’t yet directly comment but here is a small correction i’d like to add to the following part of ΤΖΩΤΖΙΟΥ’s answer :

If you prefer full pathnames, then use this function:

def listdirs(folder):  
  return [
    d for d in (os.path.join(folder, d1) for d1 in os.listdir(folder))
    if os.path.isdir(d)
]

for those still on python < 2.4: the inner construct needs to be a list instead of a tuple and therefore should read like this:

def listdirs(folder):  
  return [
    d for d in [os.path.join(folder, d1) for d1 in os.listdir(folder)]
    if os.path.isdir(d)
  ]

otherwise one gets a syntax error.

Question 12

[x for x in os.listdir(somedir) if os.path.isdir(os.path.join(somedir, x))]

Question 13

For a list of full path names I prefer this version to the other solutions here:

def listdirs(dir):
    return [os.path.join(os.path.join(dir, x)) for x in os.listdir(dir) 
        if os.path.isdir(os.path.join(dir, x))]

Question 14

scanDir = "abc"
directories = [d for d in os.listdir(scanDir) if os.path.isdir(os.path.join(os.path.abspath(scanDir), d))]

Question 15

A safer option that does not fail when there is no directory.

def listdirs(folder):
    if os.path.exists(folder):
         return [d for d in os.listdir(folder) if os.path.isdir(os.path.join(folder, d))]
    else:
         return []

Question 16

Like so?

>>>> [path for path in os.listdir(os.getcwd()) if os.path.isdir(path)]

Question 17

Python 3.4 introduced the pathlib module into the standard library, which provides an object oriented approach to handle filesystem paths:

from pathlib import Path

p = Path('./')
[f for f in p.iterdir() if f.is_dir()]

Question 18

-- This will exclude files and traverse through 1 level of sub folders in the root

def list_files(dir):
    List = []
    filterstr = ' '
    for root, dirs, files in os.walk(dir, topdown = True):
        #r.append(root)
        if (root == dir):
            pass
        elif filterstr in root:
            #filterstr = ' '
            pass
        else:
            filterstr = root
            #print(root)
            for name in files:
                print(root)
                print(dirs)
                List.append(os.path.join(root,name))
            #print(os.path.join(root,name),"\n")
                print(List,"\n")

    return List

Question 19

Suppose python code is executed in not known by prior windows directory say ‘main’ , and wherever code is installed when it runs it needs to access to directory ‘main/2091/data.txt’ .

how should I use open(location) function? what should be location ?

Edit :

I found that below simple code will work..does it have any disadvantages ?

    file="\2091\sample.txt"
    path=os.getcwd()+file
    fp=open(path,'r+');

Question 20

With this type of thing you need to be careful what your actual working directory is. For example, you may not run the script from the directory the file is in. In this case, you can’t just use a relative path by itself.

If you are sure the file you want is in a subdirectory beneath where the script is actually located, you can use __file__ to help you out here. __file__ is the full path to where the script you are running is located.

So you can fiddle with something like this:

import os
script_dir = os.path.dirname(__file__) #<-- absolute dir the script is in
rel_path = "2091/data.txt"
abs_file_path = os.path.join(script_dir, rel_path)

Question 21

This code works fine:

import os


def readFile(filename):
    filehandle = open(filename)
    print filehandle.read()
    filehandle.close()



fileDir = os.path.dirname(os.path.realpath('__file__'))
print fileDir

#For accessing the file in the same folder
filename = "same.txt"
readFile(filename)

#For accessing the file in a folder contained in the current folder
filename = os.path.join(fileDir, 'Folder1.1/same.txt')
readFile(filename)

#For accessing the file in the parent folder of the current folder
filename = os.path.join(fileDir, '../same.txt')
readFile(filename)

#For accessing the file inside a sibling folder.
filename = os.path.join(fileDir, '../Folder2/same.txt')
filename = os.path.abspath(os.path.realpath(filename))
print filename
readFile(filename)

Question 22

I created an account just so I could clarify a discrepancy I think I found in Russ’s original response.

For reference, his original answer was:

import os
script_dir = os.path.dirname(__file__)
rel_path = "2091/data.txt"
abs_file_path = os.path.join(script_dir, rel_path)

This is a great answer because it is trying to dynamically creates an absolute system path to the desired file.

Cory Mawhorter noticed that __file__ is a relative path (it is as well on my system) and suggested using os.path.abspath(__file__). os.path.abspath, however, returns the absolute path of your current script (i.e. /path/to/dir/foobar.py)

To use this method (and how I eventually got it working) you have to remove the script name from the end of the path:

import os
script_path = os.path.abspath(__file__) # i.e. /path/to/dir/foobar.py
script_dir = os.path.split(script_path)[0] #i.e. /path/to/dir/
rel_path = "2091/data.txt"
abs_file_path = os.path.join(script_dir, rel_path)

The resulting abs_file_path (in this example) becomes: /path/to/dir/2091/data.txt

Question 23

It depends on what operating system you’re using. If you want a solution that is compatible with both Windows and *nix something like:

from os import path

file_path = path.relpath("2091/data.txt")
with open(file_path) as f:
    <do stuff>

should work fine.

The path module is able to format a path for whatever operating system it’s running on. Also, python handles relative paths just fine, so long as you have correct permissions.

Edit:

As mentioned by kindall in the comments, python can convert between unix-style and windows-style paths anyway, so even simpler code will work:

with open("2091/data/txt") as f:
    <do stuff>

That being said, the path module still has some useful functions.

Question 24

I spend a lot time to discover why my code could not find my file running Python 3 on the Windows system. So I added . before / and everything worked fine:

import os

script_dir = os.path.dirname(__file__)
file_path = os.path.join(script_dir, './output03.txt')
print(file_path)
fptr = open(file_path, 'w')

Question 25

Code:

import os
script_path = os.path.abspath(__file__) 
path_list = script_path.split(os.sep)
script_directory = path_list[0:len(path_list)-1]
rel_path = "main/2091/data.txt"
path = "/".join(script_directory) + "/" + rel_path

Explanation:

Import library:

import os

Use __file__ to attain the current script’s path:

script_path = os.path.abspath(__file__)

Separates the script path into multiple items:

path_list = script_path.split(os.sep)

Remove the last item in the list (the actual script file):

script_directory = path_list[0:len(path_list)-1]

Add the relative file’s path:

rel_path = "main/2091/data.txt

Join the list items, and addition the relative path’s file:

path = "/".join(script_directory) + "/" + rel_path

Now you are set to do whatever you want with the file, such as, for example:

file = open(path)

Question 26

If the file is in your parent folder, eg. follower.txt, you can simply use open('../follower.txt', 'r').read()

Question 27

Try this:

from pathlib import Path

data_folder = Path("/relative/path")
file_to_open = data_folder / "file.pdf"

f = open(file_to_open)

print(f.read())

Python 3.4 introduced a new standard library for dealing with files and paths called pathlib. It works for me!

Question 28

Not sure if this work everywhere.

I’m using ipython in ubuntu.

If you want to read file in current folder’s sub-directory:

/current-folder/sub-directory/data.csv

your script is in current-folder simply try this:

import pandas as pd
path = './sub-directory/data.csv'
pd.read_csv(path)

Question 29

Python just passes the filename you give it to the operating system, which opens it. If your operating system supports relative paths like main/2091/data.txt (hint: it does), then that will work fine.

You may find that the easiest way to answer a question like this is to try it and see what happens.

Question 30

import os
def file_path(relative_path):
    dir = os.path.dirname(os.path.abspath(__file__))
    split_path = relative_path.split("/")
    new_path = os.path.join(dir, *split_path)
    return new_path

with open(file_path("2091/data.txt"), "w") as f:
    f.write("Powerful you have become.")

Question 31

When I was a beginner I found these descriptions a bit intimidating. As at first I would try For Windows

f= open('C:\Users\chidu\Desktop\Skipper New\Special_Note.txt','w+')
print(f)

and this would raise an syntax error. I used get confused alot. Then after some surfing across google. found why the error occurred. Writing this for beginners

It’s because for path to be read in Unicode you simple add a \ when starting file path

f= open('C:\\Users\chidu\Desktop\Skipper New\Special_Note.txt','w+')
print(f)

And now it works just add \ before starting the directory.

Question 32

numpy has three different functions which seem like they can be used for the same things — except that numpy.maximum can only be used element-wise, while numpy.max and numpy.amax can be used on particular axes, or all elements. Why is there more than just numpy.max? Is there some subtlety to this in performance?

(Similarly for min vs. amin vs. minimum)

Question 33

np.max is just an alias for np.amax. This function only works on a single input array and finds the value of maximum element in that entire array (returning a scalar). Alternatively, it takes an axis argument and will find the maximum value along an axis of the input array (returning a new array).

>>> a = np.array([[0, 1, 6],
                  [2, 4, 1]])
>>> np.max(a)
6
>>> np.max(a, axis=0) # max of each column
array([2, 4, 6])

The default behaviour of np.maximum is to take two arrays and compute their element-wise maximum. Here, ‘compatible’ means that one array can be broadcast to the other. For example:

>>> b = np.array([3, 6, 1])
>>> c = np.array([4, 2, 9])
>>> np.maximum(b, c)
array([4, 6, 9])

But np.maximum is also a universal function which means that it has other features and methods which come in useful when working with multidimensional arrays. For example you can compute the cumulative maximum over an array (or a particular axis of the array):

>>> d = np.array([2, 0, 3, -4, -2, 7, 9])
>>> np.maximum.accumulate(d)
array([2, 2, 3, 3, 3, 7, 9])

This is not possible with np.max.

You can make np.maximum imitate np.max to a certain extent when using np.maximum.reduce:

>>> np.maximum.reduce(d)
9
>>> np.max(d)
9

Basic testing suggests the two approaches are comparable in performance; and they should be, as np.max() actually calls np.maximum.reduce to do the computation.

Question 34

You’ve already stated why np.maximum is different – it returns an array that is the element-wise maximum between two arrays.

As for np.amax and np.max: they both call the same function – np.max is just an alias for np.amax, and they compute the maximum of all elements in an array, or along an axis of an array.

In [1]: import numpy as np

In [2]: np.amax
Out[2]: <function numpy.core.fromnumeric.amax>

In [3]: np.max
Out[3]: <function numpy.core.fromnumeric.amax>

Question 35

For completeness, in Numpy there are four maximum related functions. They fall into two different categories:

np.amax/np.max, np.nanmax: for single array order statistics
and np.maximum, np.fmax: for element-wise comparison of two arrays

I. For single array order statistics

NaNs propagator np.amax/np.max and its NaN ignorant counterpart np.nanmax.

np.max is just an alias of np.amax, so they are considered as one function.
```
>>> np.max.__name__
'amax'
>>> np.max is np.amax
True
```

np.max propagates NaNs while np.nanmax ignores NaNs.

>>> np.max([np.nan, 3.14, -1])
nan
>>> np.nanmax([np.nan, 3.14, -1])
3.14

II. For element-wise comparison of two arrays

NaNs propagator np.maximum and its NaNs ignorant counterpart np.fmax.

Both functions require two arrays as the first two positional args to compare with.

# x1 and x2 must be the same shape or can be broadcast
np.maximum(x1, x2, /, ...);
np.fmax(x1, x2, /, ...)

np.maximum propagates NaNs while np.fmax ignores NaNs.

>>> np.maximum([np.nan, 3.14, 0], [np.NINF, np.nan, 2.72])
array([ nan,  nan, 2.72])
>>> np.fmax([np.nan, 3.14, 0], [np.NINF, np.nan, 2.72])
array([-inf, 3.14, 2.72])

The element-wise functions are np.ufunc(Universal Function), which means they have some special properties that normal Numpy function don’t have.

>>> type(np.maximum)
<class 'numpy.ufunc'>
>>> type(np.fmax)
<class 'numpy.ufunc'>
>>> #---------------#
>>> type(np.max)
<class 'function'>
>>> type(np.nanmax)
<class 'function'>

And finally, the same rules apply to the four minimum related functions:

np.amin/np.min, np.nanmin;
and np.minimum, np.fmin.

Question 36

np.maximum not only compares elementwise but also compares array elementwise with single value

>>>np.maximum([23, 14, 16, 20, 25], 18)
array([23, 18, 18, 20, 25])

Question 37

In a iPython notebook, I have a while loop that listens to a Serial port and print the received data in real time.

What I want to achieve to only show the latest received data (i.e only one line showing the most recent data. no scrolling in the cell output area)

What I need(i think) is to clear the old cell output when I receives new data, and then prints the new data. I am wondering how can I clear old data programmatically ?

Question 38

You can use IPython.display.clear_output to clear the output of a cell.

from IPython.display import clear_output

for i in range(10):
    clear_output(wait=True)
    print("Hello World!")

At the end of this loop you will only see one Hello World!.

Without a code example it’s not easy to give you working code. Probably buffering the latest n events is a good strategy. Whenever the buffer changes you can clear the cell’s output and print the buffer again.

Question 39

And in case you come here, like I did, looking to do the same thing for plots in a Julia notebook in Jupyter, using Plots, you can use:

    IJulia.clear_output(true)

so for a kind of animated plot of multiple runs

    if nrun==1  
      display(plot(x,y))         # first plot
    else 
      IJulia.clear_output(true)  # clear the window (as above)
      display(plot!(x,y))        # plot! overlays the plot
    end

Without the clear_output call, all plots appear separately.

Question 40

You can use the IPython.display.clear_output to clear the output as mentioned in cel’s answer. I would add that for me the best solution was to use this combination of parameters to print without any “shakiness” of the notebook:

from IPython.display import clear_output

for i in range(10):
    clear_output(wait=True)
    print(i, flush=True)

Question 41

I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?

I am trying to do the following for feature selection:

I read the train file:

num_rows_to_read = 10000
train_small = pd.read_csv("../../dataset/train.csv",   nrows=num_rows_to_read)

I change the type of the categorical features to ‘category’:

non_categorial_features = ['orig_destination_distance',
                          'srch_adults_cnt',
                          'srch_children_cnt',
                          'srch_rm_cnt',
                          'cnt']

for categorical_feature in list(train_small.columns):
    if categorical_feature not in non_categorial_features:
        train_small[categorical_feature] = train_small[categorical_feature].astype('category')

I use one hot encoding:

train_small_with_dummies = pd.get_dummies(train_small, sparse=True)

The problem is that the 3’rd part often get stuck, although I am using a strong machine.

Thus, without the one hot encoding I can’t do any feature selection, for determining the importance of the features.

What do you recommend?

Question 42

Approach 1: You can use pandas’ pd.get_dummies.

Example 1:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]: 
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
3  1.0  0.0  0.0

Example 2:

The following will transform a given column into one hot. Use prefix to have multiple dummies.

import pandas as pd
        
df = pd.DataFrame({
          'A':['a','b','a'],
          'B':['b','a','c']
        })
df
Out[]: 
   A  B
0  a  b
1  b  a
2  a  c

# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df  
Out[]: 
       A  a  b  c
    0  a  0  1  0
    1  b  1  0  0
    2  a  0  0  1

Approach 2: Use Scikit-learn

Using a OneHotEncoder has the advantage of being able to fit on some training data and then transform on some other data using the same instance. We also have handle_unknown to further control what the encoder does with unseen data.

Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])   
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Here is the link for this example: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Question 43

Much easier to use Pandas for basic one-hot encoding. If you’re looking for more options you can use scikit-learn.

For basic one-hot encoding with Pandas you pass your data frame into the get_dummies function.

For example, if I have a dataframe called imdb_movies:

…and I want to one-hot encode the Rated column, I do this:

pd.get_dummies(imdb_movies.Rated)

This returns a new dataframe with a column for every “level” of rating that exists, along with either a 1 or 0 specifying the presence of that rating for a given observation.

Usually, we want this to be part of the original dataframe. In this case, we attach our new dummy coded frame onto the original frame using “column-binding.

We can column-bind by using Pandas concat function:

rated_dummies = pd.get_dummies(imdb_movies.Rated)
pd.concat([imdb_movies, rated_dummies], axis=1)

We can now run an analysis on our full dataframe.

SIMPLE UTILITY FUNCTION

I would recommend making yourself a utility function to do this quickly:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    return(res)

Usage:

encode_and_bind(imdb_movies, 'Rated')

Result:

Also, as per @pmalbu comment, if you would like the function to remove the original feature_to_encode then use this version:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res)

You can encode multiple features at the same time as follows:

features_to_encode = ['feature_1', 'feature_2', 'feature_3',
                      'feature_4']
for feature in features_to_encode:
    res = encode_and_bind(train_set, feature)

Question 44

You can do it with numpy.eye and a using the array element selection mechanism:

import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]

def indices_to_one_hot(data, nb_classes):
    """Convert an iterable of indices to one-hot encoded labels."""
    targets = np.array(data).reshape(-1)
    return np.eye(nb_classes)[targets]

The the return value of indices_to_one_hot(nb_classes, data) is now

array([[[ 0.,  0.,  1.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  1.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  1.,  0.],
        [ 1.,  0.,  0.,  0.,  0.,  0.]]])

The .reshape(-1) is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]).

Question 45

Firstly, easiest way to one hot encode: use Sklearn.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Secondly, I don’t think using pandas to one hot encode is that simple (unconfirmed though)

Creating dummy variables in pandas for python

Lastly, is it necessary for you to one hot encode? One hot encoding exponentially increases the number of features, drastically increasing the run time of any classifier or anything else you are going to run. Especially when each categorical feature has many levels. Instead you can do dummy coding.

Using dummy encoding usually works well, for much less run time and complexity. A wise prof once told me, ‘Less is More’.

Here’s the code for my custom encoding function if you want.

from sklearn.preprocessing import LabelEncoder

#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df

EDIT: Comparison to be clearer:

One-hot encoding: convert n levels to n-1 columns.

Index  Animal         Index  cat  mouse
  1     dog             1     0     0
  2     cat       -->   2     1     0
  3    mouse            3     0     1

You can see how this will explode your memory if you have many different types (or levels) in your categorical feature. Keep in mind, this is just ONE column.

Dummy Coding:

Index  Animal         Index  Animal
  1     dog             1      0   
  2     cat       -->   2      1 
  3    mouse            3      2

Convert to numerical representations instead. Greatly saves feature space, at the cost of a bit of accuracy.

Question 46

One hot encoding with pandas is very easy:

def one_hot(df, cols):
    """
    @param df pandas DataFrame
    @param cols a list of columns to encode 
    @return a DataFrame with one-hot encoding
    """
    for each in cols:
        dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
        df = pd.concat([df, dummies], axis=1)
    return df

EDIT:

Another way to one_hot using sklearn’s LabelBinarizer :

from sklearn.preprocessing import LabelBinarizer 
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later

def one_hot_encode(x):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
    """
    return label_binarizer.transform(x)

Question 47

You can use numpy.eye function.

import numpy as np

def one_hot_encode(x, n_classes):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
     """
    return np.eye(n_classes)[x]

def main():
    list = [0,1,2,3,4,3,2,1,0]
    n_classes = 5
    one_hot_list = one_hot_encode(list, n_classes)
    print(one_hot_list)

if __name__ == "__main__":
    main()

Result

D:\Desktop>python test.py
[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]]

Question 48

pandas as has inbuilt function “get_dummies” to get one hot encoding of that particular column/s.

one line code for one-hot-encoding:

df=pd.concat([df,pd.get_dummies(df['column name'],prefix='column name')],axis=1).drop(['column name'],axis=1)

Question 49

Here is a solution using DictVectorizer and the Pandas DataFrame.to_dict('records') method.

>>> import pandas as pd
>>> X = pd.DataFrame({'income': [100000,110000,90000,30000,14000,50000],
                      'country':['US', 'CAN', 'US', 'CAN', 'MEX', 'US'],
                      'race':['White', 'Black', 'Latino', 'White', 'White', 'Black']
                     })

>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer()
>>> qualitative_features = ['country','race']
>>> X_qual = v.fit_transform(X[qualitative_features].to_dict('records'))
>>> v.vocabulary_
{'country=CAN': 0,
 'country=MEX': 1,
 'country=US': 2,
 'race=Black': 3,
 'race=Latino': 4,
 'race=White': 5}

>>> X_qual.toarray()
array([[ 0.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  1.,  0.,  0.]])

Question 50

One-hot encoding requires bit more than converting the values to indicator variables. Typically ML process requires you to apply this coding several times to validation or test data sets and applying the model you construct to real-time observed data. You should store the mapping (transform) that was used to construct the model. A good solution would use the DictVectorizer or LabelEncoder (followed by get_dummies. Here is a function that you can use:

def oneHotEncode2(df, le_dict = {}):
    if not le_dict:
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        train = True;
    else:
        columnsToEncode = le_dict.keys()   
        train = False;

    for feature in columnsToEncode:
        if train:
            le_dict[feature] = LabelEncoder()
        try:
            if train:
                df[feature] = le_dict[feature].fit_transform(df[feature])
            else:
                df[feature] = le_dict[feature].transform(df[feature])

            df = pd.concat([df, 
                              pd.get_dummies(df[feature]).rename(columns=lambda x: feature + '_' + str(x))], axis=1)
            df = df.drop(feature, axis=1)
        except:
            print('Error encoding '+feature)
            #df[feature]  = df[feature].convert_objects(convert_numeric='force')
            df[feature]  = df[feature].apply(pd.to_numeric, errors='coerce')
    return (df, le_dict)

This works on a pandas dataframe and for each column of the dataframe it creates and returns a mapping back. So you would call it like this:

train_data, le_dict = oneHotEncode2(train_data)

Then on the test data, the call is made by passing the dictionary returned back from training:

test_data, _ = oneHotEncode2(test_data, le_dict)

An equivalent method is to use DictVectorizer. A related post on the same is on my blog. I mention it here since it provides some reasoning behind this approach over simply using get_dummies post (disclosure: this is my own blog).

Question 51

You can pass the data to catboost classifier without encoding. Catboost handles categorical variables itself by performing one-hot and target expanding mean encoding.

Question 52

You can do the following as well. Note for the below you don’t have to use pd.concat.

import pandas as pd 
# intialise data of lists. 
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
       'Group':[1,2,1,2]} 

# Create DataFrame 
df = pd.DataFrame(data) 

for _c in df.select_dtypes(include=['object']).columns:
    print(_c)
    df[_c]  = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed

You can also change explicit columns to categorical. For example, here I am changing the Color and Group

import pandas as pd 
# intialise data of lists. 
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
       'Group':[1,2,1,2]} 

# Create DataFrame 
df = pd.DataFrame(data) 
columns_to_change = list(df.select_dtypes(include=['object']).columns)
columns_to_change.append('Group')
for _c in columns_to_change:
    print(_c)
    df[_c]  = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed

Question 53

I know I’m late to this party, but the simplest way to hot encode a dataframe in an automated way is to use this function:

def hot_encode(df):
    obj_df = df.select_dtypes(include=['object'])
    return pd.get_dummies(df, columns=obj_df.columns).values

Question 54

I used this in my acoustic model: probably this helps in ur model.

def one_hot_encoding(x, n_out):
    x = x.astype(int)  
    shape = x.shape
    x = x.flatten()
    N = len(x)
    x_categ = np.zeros((N,n_out))
    x_categ[np.arange(N), x] = 1
    return x_categ.reshape((shape)+(n_out,))

Question 55

To add to other questions, let me provide how I did it with a Python 2.0 function using Numpy:

def one_hot(y_):
    # Function to encode output labels from number indexes 
    # e.g.: [[5], [0], [3]] --> [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]]

    y_ = y_.reshape(len(y_))
    n_values = np.max(y_) + 1
    return np.eye(n_values)[np.array(y_, dtype=np.int32)]  # Returns FLOATS

The line n_values = np.max(y_) + 1 could be hard-coded for you to use the good number of neurons in case you use mini-batches for example.

Demo project/tutorial where this function has been used: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition

Question 56

This works for me:

pandas.factorize( ['B', 'C', 'D', 'B'] )[0]

Output:

[0, 1, 2, 0]

Question 57

It can and it should be easy as :

class OneHotEncoder:
    def __init__(self,optionKeys):
        length=len(optionKeys)
        self.__dict__={optionKeys[j]:[0 if i!=j else 1 for i in range(length)] for j in range(length)}

Usage :

ohe=OneHotEncoder(["A","B","C","D"])
print(ohe.A)
print(ohe.D)

Question 58

Expanding @Martin Thoma’s answer

def one_hot_encode(y):
    """Convert an iterable of indices to one-hot encoded labels."""
    y = y.flatten() # Sometimes not flattened vector is passed e.g (118,1) in these cases
    # the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
    nb_classes = len(np.unique(y)) # get the number of unique classes
    standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
    # which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is
    # directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.
    # standardised labels fixes this issue by returning a dictionary;
    # standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.
    # standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element
    # cannot be called by an integer index e.g y[1.0] - throws an index error.
    targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
    return np.eye(nb_classes)[targets]

Question 59

Short Answer

Here is a function to do one-hot-encoding without using numpy, pandas, or other packages. It takes a list of integers, booleans, or strings (and perhaps other types too).

import typing


def one_hot_encode(items: list) -> typing.List[list]:
    results = []
    # find the unique items (we want to unique items b/c duplicate items will have the same encoding)
    unique_items = list(set(items))
    # sort the unique items
    sorted_items = sorted(unique_items)
    # find how long the list of each item should be
    max_index = len(unique_items)

    for item in items:
        # create a list of zeros the appropriate length
        one_hot_encoded_result = [0 for i in range(0, max_index)]
        # find the index of the item
        one_hot_index = sorted_items.index(item)
        # change the zero at the index from the previous line to a one
        one_hot_encoded_result[one_hot_index] = 1
        # add the result
        results.append(one_hot_encoded_result)

    return results

Example:

one_hot_encode([2, 1, 1, 2, 5, 3])

# [[0, 1, 0, 0],
#  [1, 0, 0, 0],
#  [1, 0, 0, 0],
#  [0, 1, 0, 0],
#  [0, 0, 0, 1],
#  [0, 0, 1, 0]]

one_hot_encode([True, False, True])

# [[0, 1], [1, 0], [0, 1]]

one_hot_encode(['a', 'b', 'c', 'a', 'e'])

# [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 0, 1]]

Long(er) Answer

I know there are already a lot of answers to this question, but I noticed two things. First, most of the answers use packages like numpy and/or pandas. And this is a good thing. If you are writing production code, you should probably be using robust, fast algorithms like those provided in the numpy/pandas packages. But, for the sake of education, I think someone should provide an answer which has a transparent algorithm and not just an implementation of someone else’s algorithm. Second, I noticed that many of the answers do not provide a robust implementation of one-hot encoding because they do not meet one of the requirements below. Below are some of the requirements (as I see them) for a useful, accurate, and robust one-hot encoding function:

A one-hot encoding function must:

handle list of various types (e.g. integers, strings, floats, etc.) as input
handle an input list with duplicates
return a list of lists corresponding (in the same order as) to the inputs
return a list of lists where each list is as short as possible

I tested many of the answers to this question and most of them fail on one of the requirements above.

Question 60

Try this:

!pip install category_encoders
import category_encoders as ce

categorical_columns = [...the list of names of the columns you want to one-hot-encode ...]
encoder = ce.OneHotEncoder(cols=categorical_columns, use_cat_names=True)
df_train_encoded = encoder.fit_transform(df_train_small)

df_encoded.head()

The resulting dataframe df_train_encoded is the same as the original, but the categorical features are now replaced with their one-hot-encoded versions.

More information on category_encoders here.

Question 61

Here i tried with this approach :

import numpy as np
#converting to one_hot





def one_hot_encoder(value, datal):

    datal[value] = 1

    return datal


def _one_hot_values(labels_data):
    encoded = [0] * len(labels_data)

    for j, i in enumerate(labels_data):
        max_value = [0] * (np.max(labels_data) + 1)

        encoded[j] = one_hot_encoder(i, max_value)

    return np.array(encoded)

Question 62

I have a simple problem, but I cannot find a good solution to it.

I want to take a NumPy 2D array which represents a grayscale image, and convert it to an RGB PIL image while applying some of the matplotlib colormaps.

I can get a reasonable PNG output by using the pyplot.figure.figimage command:

dpi = 100.0
w, h = myarray.shape[1]/dpi, myarray.shape[0]/dpi
fig = plt.figure(figsize=(w,h), dpi=dpi)
fig.figimage(sub, cmap=cm.gist_earth)
plt.savefig('out.png')

Although I could adapt this to get what I want (probably using StringIO do get the PIL image), I wonder if there is not a simpler way to do that, since it seems to be a very natural problem of image visualization. Let’s say, something like this:

colored_PIL_image = magic_function(array, cmap)

Question 63

Quite a busy one-liner, but here it is:

First ensure your NumPy array, myarray, is normalised with the max value at 1.0.
Apply the colormap directly to myarray.
Rescale to the 0-255 range.
Convert to integers, using np.uint8().
Use Image.fromarray().

And you’re done:

from PIL import Image
from matplotlib import cm
im = Image.fromarray(np.uint8(cm.gist_earth(myarray)*255))

with plt.savefig():

with im.save():

Question 64

input = numpy_image
np.unit8 -> converts to integers
convert(‘RGB’) -> converts to RGB

Image.fromarray -> returns an image object

from PIL import Image
import numpy as np

PIL_image = Image.fromarray(np.uint8(numpy_image)).convert('RGB')

PIL_image = Image.fromarray(numpy_image.astype('uint8'), 'RGB')

问题：如何仅列出Python中的顶级目录？

回答 0

回答 1

步行

如何运作

os.walk

How this works

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

问题：在Python的相对位置打开文件

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

问题：numpy max vs amax vs maximum

回答 0

回答 1

回答 2

一，单阵列订单统计

二。用于两个数组的元素比较

I. For single array order statistics

II. For element-wise comparison of two arrays

回答 3

问题：iPython Notebook清除代码中的单元格输出

回答 0

回答 1

回答 2

问题：如何在Python中进行热编码？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

简短答案

长（长）答案

Short Answer

Long(er) Answer

回答 18

回答 19

问题：如何使用matplotlib颜色图将NumPy数组转换为PIL图像

回答 0

回答 1

回答 2

问题：Pandas DataFrame：根据条件替换列中的所有值

回答 0

回答 1