标签归档:dataframe

Python Pandas:逐行填充数据框

问题:Python Pandas:逐行填充数据框

pandas.DataFrame对象添加一行的简单任务似乎很难完成。有3个与此相关的stackoverflow问题,没有一个给出有效的答案。

这就是我想要做的。我有一个DataFrame,我已经知道它的形状以及行和列的名称。

>>> df = pandas.DataFrame(columns=['a','b','c','d'], index=['x','y','z'])
>>> df
     a    b    c    d
x  NaN  NaN  NaN  NaN
y  NaN  NaN  NaN  NaN
z  NaN  NaN  NaN  NaN

现在,我有一个函数来迭代计算行的值。如何用字典或a填充行之一pandas.Series?这是各种失败的尝试:

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df['y'] = y
AssertionError: Length of values does not match length of index

显然,它试图添加一列而不是一行。

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df.join(y)
AttributeError: 'builtin_function_or_method' object has no attribute 'is_unique'

错误消息非常少。

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df.set_value(index='y', value=y)
TypeError: set_value() takes exactly 4 arguments (3 given)

显然,这仅用于设置数据框中的各个值。

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df.append(y)
Exception: Can only append a Series if ignore_index=True

好吧,我不想忽略索引,否则结果如下:

>>> df.append(y, ignore_index=True)
     a    b    c    d
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN
3    1    5    2    3

它确实使列名与值对齐,但是丢失了行标签。

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df.ix['y'] = y
>>> df
                                  a                                 b  \
x                               NaN                               NaN
y  {'a': 1, 'c': 2, 'b': 5, 'd': 3}  {'a': 1, 'c': 2, 'b': 5, 'd': 3}
z                               NaN                               NaN

                                  c                                 d
x                               NaN                               NaN
y  {'a': 1, 'c': 2, 'b': 5, 'd': 3}  {'a': 1, 'c': 2, 'b': 5, 'd': 3}
z                               NaN                               NaN

那也失败了。

你是怎么做到的 ?

The simple task of adding a row to a pandas.DataFrame object seems to be hard to accomplish. There are 3 stackoverflow questions relating to this, none of which give a working answer.

Here is what I’m trying to do. I have a DataFrame of which I already know the shape as well as the names of the rows and columns.

>>> df = pandas.DataFrame(columns=['a','b','c','d'], index=['x','y','z'])
>>> df
     a    b    c    d
x  NaN  NaN  NaN  NaN
y  NaN  NaN  NaN  NaN
z  NaN  NaN  NaN  NaN

Now, I have a function to compute the values of the rows iteratively. How can I fill in one of the rows with either a dictionary or a pandas.Series ? Here are various attempts that have failed:

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df['y'] = y
AssertionError: Length of values does not match length of index

Apparently it tried to add a column instead of a row.

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df.join(y)
AttributeError: 'builtin_function_or_method' object has no attribute 'is_unique'

Very uninformative error message.

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df.set_value(index='y', value=y)
TypeError: set_value() takes exactly 4 arguments (3 given)

Apparently that is only for setting individual values in the dataframe.

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df.append(y)
Exception: Can only append a Series if ignore_index=True

Well, I don’t want to ignore the index, otherwise here is the result:

>>> df.append(y, ignore_index=True)
     a    b    c    d
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN
3    1    5    2    3

It did align the column names with the values, but lost the row labels.

>>> y = {'a':1, 'b':5, 'c':2, 'd':3} 
>>> df.ix['y'] = y
>>> df
                                  a                                 b  \
x                               NaN                               NaN
y  {'a': 1, 'c': 2, 'b': 5, 'd': 3}  {'a': 1, 'c': 2, 'b': 5, 'd': 3}
z                               NaN                               NaN

                                  c                                 d
x                               NaN                               NaN
y  {'a': 1, 'c': 2, 'b': 5, 'd': 3}  {'a': 1, 'c': 2, 'b': 5, 'd': 3}
z                               NaN                               NaN

That also failed miserably.

So how do you do it ?


回答 0

df['y'] 将设置一列

由于您要设置行,请使用 .loc

请注意,这.ix等效于您,您的失败了,因为您试图为该行的每个元素分配一个字典,y可能不是您想要的。转换为Series会告诉熊猫您要对齐输入(例如,您不必指定所有元素)

In [7]: df = pandas.DataFrame(columns=['a','b','c','d'], index=['x','y','z'])

In [8]: df.loc['y'] = pandas.Series({'a':1, 'b':5, 'c':2, 'd':3})

In [9]: df
Out[9]: 
     a    b    c    d
x  NaN  NaN  NaN  NaN
y    1    5    2    3
z  NaN  NaN  NaN  NaN

df['y'] will set a column

since you want to set a row, use .loc

Note that .ix is equivalent here, yours failed because you tried to assign a dictionary to each element of the row y probably not what you want; converting to a Series tells pandas that you want to align the input (for example you then don’t have to to specify all of the elements)

In [7]: df = pandas.DataFrame(columns=['a','b','c','d'], index=['x','y','z'])

In [8]: df.loc['y'] = pandas.Series({'a':1, 'b':5, 'c':2, 'd':3})

In [9]: df
Out[9]: 
     a    b    c    d
x  NaN  NaN  NaN  NaN
y    1    5    2    3
z  NaN  NaN  NaN  NaN

回答 1

我的方法是,但是我不能保证这是最快的解决方案。

df = pd.DataFrame(columns=["firstname", "lastname"])
df = df.append({
     "firstname": "John",
     "lastname":  "Johny"
      }, ignore_index=True)

My approach was, but I can’t guarantee that this is the fastest solution.

df = pd.DataFrame(columns=["firstname", "lastname"])
df = df.append({
     "firstname": "John",
     "lastname":  "Johny"
      }, ignore_index=True)

回答 2

这是一个简单的版本

import pandas as pd
df = pd.DataFrame(columns=('col1', 'col2', 'col3'))
for i in range(5):
   df.loc[i] = ['<some value for first>','<some value for second>','<some value for third>']`

This is a simpler version

import pandas as pd
df = pd.DataFrame(columns=('col1', 'col2', 'col3'))
for i in range(5):
   df.loc[i] = ['<some value for first>','<some value for second>','<some value for third>']`

回答 3

如果您的输入行是列表而不是字典,那么以下是一个简单的解决方案:

import pandas as pd
list_of_lists = []
list_of_lists.append([1,2,3])
list_of_lists.append([4,5,6])

pd.DataFrame(list_of_lists, columns=['A', 'B', 'C'])
#    A  B  C
# 0  1  2  3
# 1  4  5  6

If your input rows are lists rather than dictionaries, then the following is a simple solution:

import pandas as pd
list_of_lists = []
list_of_lists.append([1,2,3])
list_of_lists.append([4,5,6])

pd.DataFrame(list_of_lists, columns=['A', 'B', 'C'])
#    A  B  C
# 0  1  2  3
# 1  4  5  6

Pandas DataFrame:根据条件替换列中的所有值

问题:Pandas DataFrame:根据条件替换列中的所有值

我有一个简单的DataFrame如下所示:

我想从“第一季”列中选择所有值,然后将1990年以后的值替换为1。在此示例中,只有巴尔的摩乌鸦将1996年替换为1(其余数据保持不变)。

我使用了以下内容:

df.loc[(df['First Season'] > 1990)] = 1

但是,它将行中的所有值替换为1,而不仅仅是“第一季”列中的值。

如何仅替换该列中的值?

I have a simple DataFrame like the following:

I want to select all values from the ‘First Season’ column and replace those that are over 1990 by 1. In this example, only Baltimore Ravens would have the 1996 replaced by 1 (keeping the rest of the data intact).

I have used the following:

df.loc[(df['First Season'] > 1990)] = 1

But, it replaces all the values in that row by 1, and not just the values in the ‘First Season’ column.

How can I replace just the values from that column?


回答 0

您需要选择该列:

In [41]:
df.loc[df['First Season'] > 1990, 'First Season'] = 1
df

Out[41]:
                 Team  First Season  Total Games
0      Dallas Cowboys          1960          894
1       Chicago Bears          1920         1357
2   Green Bay Packers          1921         1339
3      Miami Dolphins          1966          792
4    Baltimore Ravens             1          326
5  San Franciso 49ers          1950         1003

所以这里的语法是:

df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]

您可以检查文档以及显示语义的10分钟熊猫查询

编辑

如果你想生成一个布尔值指标,那么你可以只使用布尔条件产生boolean值系列和铸铁的D型到int这将转换TrueFalse10分别为:

In [43]:
df['First Season'] = (df['First Season'] > 1990).astype(int)
df

Out[43]:
                 Team  First Season  Total Games
0      Dallas Cowboys             0          894
1       Chicago Bears             0         1357
2   Green Bay Packers             0         1339
3      Miami Dolphins             0          792
4    Baltimore Ravens             1          326
5  San Franciso 49ers             0         1003

You need to select that column:

In [41]:
df.loc[df['First Season'] > 1990, 'First Season'] = 1
df

Out[41]:
                 Team  First Season  Total Games
0      Dallas Cowboys          1960          894
1       Chicago Bears          1920         1357
2   Green Bay Packers          1921         1339
3      Miami Dolphins          1966          792
4    Baltimore Ravens             1          326
5  San Franciso 49ers          1950         1003

So the syntax here is:

df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]

You can check the docs and also the 10 minutes to pandas which shows the semantics

EDIT

If you want to generate a boolean indicator then you can just use the boolean condition to generate a boolean Series and cast the dtype to int this will convert True and False to 1 and 0 respectively:

In [43]:
df['First Season'] = (df['First Season'] > 1990).astype(int)
df

Out[43]:
                 Team  First Season  Total Games
0      Dallas Cowboys             0          894
1       Chicago Bears             0         1357
2   Green Bay Packers             0         1339
3      Miami Dolphins             0          792
4    Baltimore Ravens             1          326
5  San Franciso 49ers             0         1003

回答 1

聚会晚了一点,但仍然-我更喜欢在以下地方使用numpy:

import numpy as np
df['First Season'] = np.where(df['First Season'] > 1990, 1, df['First Season'])

A bit late to the party but still – I prefer using numpy where:

import numpy as np
df['First Season'] = np.where(df['First Season'] > 1990, 1, df['First Season'])

回答 2

df['First Season'].loc[(df['First Season'] > 1990)] = 1

奇怪的是没有人有这个答案,您的代码唯一缺少的部分是df之后的[‘First Season’],只需删除其中的大括号即可。

df['First Season'].loc[(df['First Season'] > 1990)] = 1

strange that nobody has this answer, the only missing part of your code is the [‘First Season’] right after df and just remove your curly brackets inside.


回答 3

对于单一条件,即。 ( 'employrate'] > 70 )

       country        employrate alcconsumption
0  Afghanistan  55.7000007629394            .03
1      Albania  51.4000015258789           7.29
2      Algeria              50.5            .69
3      Andorra                            10.17
4       Angola  75.6999969482422           5.57

用这个:

df.loc[df['employrate'] > 70, 'employrate'] = 7

       country  employrate alcconsumption
0  Afghanistan   55.700001            .03
1      Albania   51.400002           7.29
2      Algeria   50.500000            .69
3      Andorra         nan          10.17
4       Angola    7.000000           5.57

因此,语法如下:

df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]

对于多个条件,即。 (df['employrate'] <=55) & (df['employrate'] > 50)

用这个:

df['employrate'] = np.where(
   (df['employrate'] <=55) & (df['employrate'] > 50) , 11, df['employrate']
   )

out[108]:
       country  employrate alcconsumption
0  Afghanistan   55.700001            .03
1      Albania   11.000000           7.29
2      Algeria   11.000000            .69
3      Andorra         nan          10.17
4       Angola   75.699997           5.57

因此,语法如下:

 df['<column_name>'] = np.where((<filter 1> ) & (<filter 2>) , <new value>, df['column_name'])

for single condition, ie. ( 'employrate'] > 70 )

       country        employrate alcconsumption
0  Afghanistan  55.7000007629394            .03
1      Albania  51.4000015258789           7.29
2      Algeria              50.5            .69
3      Andorra                            10.17
4       Angola  75.6999969482422           5.57

use this:

df.loc[df['employrate'] > 70, 'employrate'] = 7

       country  employrate alcconsumption
0  Afghanistan   55.700001            .03
1      Albania   51.400002           7.29
2      Algeria   50.500000            .69
3      Andorra         nan          10.17
4       Angola    7.000000           5.57

therefore syntax here is:

df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]

For multiple conditions ie. (df['employrate'] <=55) & (df['employrate'] > 50)

use this:

df['employrate'] = np.where(
   (df['employrate'] <=55) & (df['employrate'] > 50) , 11, df['employrate']
   )

out[108]:
       country  employrate alcconsumption
0  Afghanistan   55.700001            .03
1      Albania   11.000000           7.29
2      Algeria   11.000000            .69
3      Andorra         nan          10.17
4       Angola   75.699997           5.57

therefore syntax here is:

 df['<column_name>'] = np.where((<filter 1> ) & (<filter 2>) , <new value>, df['column_name'])

回答 4

df.loc[df['First season'] > 1990, 'First Season'] = 1

说明:

df.loc接受两个参数,“行索引”和“列索引”。我们正在“第一季”列下检查该值是否大于每行值的27,然后将其替换为1。

df.loc[df['First season'] > 1990, 'First Season'] = 1

Explanation:

df.loc takes two arguments, ‘row index’ and ‘column index’. We are checking if the value is greater than 27 of each row value, under “First season” column and then we replacing it with 1.


从列中的字符串中删除不需要的部分

问题:从列中的字符串中删除不需要的部分

我正在寻找一种有效的方法来从DataFrame列的字符串中删除不需要的部分。

数据如下:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

我需要将这些数据修剪为:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

我试过了.str.lstrip('+-')str.rstrip('aAbBcC'),但出现错误:

TypeError: wrapper() takes exactly 1 argument (2 given)

任何指针将不胜感激!

I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.

Data looks like:

    time    result
1    09:00   +52A
2    10:00   +62B
3    11:00   +44a
4    12:00   +30b
5    13:00   -110a

I need to trim these data to:

    time    result
1    09:00   52
2    10:00   62
3    11:00   44
4    12:00   30
5    13:00   110

I tried .str.lstrip('+-') and .str.rstrip('aAbBcC'), but got an error:

TypeError: wrapper() takes exactly 1 argument (2 given)

Any pointers would be greatly appreciated!


回答 0

data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))

回答 1

如何从列的字符串中删除不需要的部分?

在最初提出问题的6年后,pandas现在具有大量的“向量化”字符串函数,可以简洁地执行这些字符串操作操作。

该答案将探索其中的一些字符串函数,提出更快的替代方法,最后进行时序比较。


.str.replace

指定要匹配的子字符串/样式,以及要替换为的子字符串。

pd.__version__
# '0.24.1'

df    
    time result
1  09:00   +52A
2  10:00   +62B
3  11:00   +44a
4  12:00   +30b
5  13:00  -110a

df['result'] = df['result'].str.replace(r'\D', '')
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

如果您需要将结果转换为整数,则可以使用Series.astype

df['result'] = df['result'].str.replace(r'\D', '').astype(int)

df.dtypes
time      object
result     int64
dtype: object

如果您不想df就地修改,请使用DataFrame.assign

df2 = df.assign(result=df['result'].str.replace(r'\D', ''))
df
# Unchanged

.str.extract

对于提取要保留的子字符串很有用。

df['result'] = df['result'].str.extract(r'(\d+)', expand=False)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

使用extract,必须指定至少一个捕获组。expand=False将返回带有第一个捕获组中捕获项目的系列。


.str.split.str.get

假设您所有的字符串都遵循这种一致的结构,则拆分工作有效。

# df['result'] = df['result'].str.split(r'\D').str[1]
df['result'] = df['result'].str.split(r'\D').str.get(1)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

如果您正在寻找一般的解决方案,则不建议这样做。


如果您对str 上述基于简洁和可读的访问器的解决方案感到满意,则可以在此处停止。但是,如果您对更快,性能更高的替代产品感兴趣,请继续阅读。


优化:列表理解

在某些情况下,列表理解应优于熊猫字符串函数。原因是因为字符串函数本来就很难向量化(从字面意义上来说),所以大多数字符串和正则表达式函数只是循环包装,开销更大。

我写的文章,熊猫中的for循环真的不好吗?我什么时候应该在意?,详细介绍。

str.replace选项可以使用重写re.sub

import re

# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result'] = [p.sub('', x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

str.extract示例可以使用列表理解用来重写re.search

p = re.compile(r'\d+')
df['result'] = [p.search(x)[0] for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

如果可能出现NaN或不匹配的情况,则您需要重新编写上面的内容以包含一些错误检查。我使用一个函数来做到这一点。

def try_extract(pattern, string):
    try:
        m = pattern.search(string)
        return m.group(0)
    except (TypeError, ValueError, AttributeError):
        return np.nan

p = re.compile(r'\d+')
df['result'] = [try_extract(p, x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

我们还可以使用列表推导来重写@eumiro和@MonkeyButter的答案:

df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]

和,

df['result'] = [x[1:-1] for x in df['result']]

适用于处理NaN等的相同规则。


性能比较

使用perfplot生成的图。完整的代码清单,供您参考。相关功能在下面列出。

这些比较中的一些比较不公平,因为它们利用了OP数据的结构,但从中得到了好处。需要注意的一件事是,每个列表理解功能都比其等效的pandas变体更快或更可比。

功能

def eumiro(df):
    return df.assign(
        result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))

def coder375(df):
    return df.assign(
        result=df['result'].replace(r'\D', r'', regex=True))

def monkeybutter(df):
    return df.assign(result=df['result'].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))

def cs1(df):
    return df.assign(result=df['result'].str.replace(r'\D', ''))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou's. so timing together.
    return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub('', x) for x in df['result']])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df['result']])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df['result']])

How do I remove unwanted parts from strings in a column?

6 years after the original question was posted, pandas now has a good number of “vectorised” string functions that can succinctly perform these string manipulation operations.

This answer will explore some of these string functions, suggest faster alternatives, and go into a timings comparison at the end.


.str.replace

Specify the substring/pattern to match, and the substring to replace it with.

pd.__version__
# '0.24.1'

df    
    time result
1  09:00   +52A
2  10:00   +62B
3  11:00   +44a
4  12:00   +30b
5  13:00  -110a

df['result'] = df['result'].str.replace(r'\D', '')
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If you need the result converted to an integer, you can use Series.astype,

df['result'] = df['result'].str.replace(r'\D', '').astype(int)

df.dtypes
time      object
result     int64
dtype: object

If you don’t want to modify df in-place, use DataFrame.assign:

df2 = df.assign(result=df['result'].str.replace(r'\D', ''))
df
# Unchanged

.str.extract

Useful for extracting the substring(s) you want to keep.

df['result'] = df['result'].str.extract(r'(\d+)', expand=False)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

With extract, it is necessary to specify at least one capture group. expand=False will return a Series with the captured items from the first capture group.


.str.split and .str.get

Splitting works assuming all your strings follow this consistent structure.

# df['result'] = df['result'].str.split(r'\D').str[1]
df['result'] = df['result'].str.split(r'\D').str.get(1)
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

Do not recommend if you are looking for a general solution.


If you are satisfied with the succinct and readable str accessor-based solutions above, you can stop here. However, if you are interested in faster, more performant alternatives, keep reading.


Optimizing: List Comprehensions

In some circumstances, list comprehensions should be favoured over pandas string functions. The reason is because string functions are inherently hard to vectorize (in the true sense of the word), so most string and regex functions are only wrappers around loops with more overhead.

My write-up, Are for-loops in pandas really bad? When should I care?, goes into greater detail.

The str.replace option can be re-written using re.sub

import re

# Pre-compile your regex pattern for more performance.
p = re.compile(r'\D')
df['result'] = [p.sub('', x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

The str.extract example can be re-written using a list comprehension with re.search,

p = re.compile(r'\d+')
df['result'] = [p.search(x)[0] for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

If NaNs or no-matches are a possibility, you will need to re-write the above to include some error checking. I do this using a function.

def try_extract(pattern, string):
    try:
        m = pattern.search(string)
        return m.group(0)
    except (TypeError, ValueError, AttributeError):
        return np.nan

p = re.compile(r'\d+')
df['result'] = [try_extract(p, x) for x in df['result']]
df

    time result
1  09:00     52
2  10:00     62
3  11:00     44
4  12:00     30
5  13:00    110

We can also re-write @eumiro’s and @MonkeyButter’s answers using list comprehensions:

df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]

And,

df['result'] = [x[1:-1] for x in df['result']]

Same rules for handling NaNs, etc, apply.


Performance Comparison

Graphs generated using perfplot. Full code listing, for your reference. The relevant functions are listed below.

Some of these comparisons are unfair because they take advantage of the structure of OP’s data, but take from it what you will. One thing to note is that every list comprehension function is either faster or comparable than its equivalent pandas variant.

Functions

def eumiro(df):
    return df.assign(
        result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC')))

def coder375(df):
    return df.assign(
        result=df['result'].replace(r'\D', r'', regex=True))

def monkeybutter(df):
    return df.assign(result=df['result'].map(lambda x: x[1:-1]))

def wes(df):
    return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC'))

def cs1(df):
    return df.assign(result=df['result'].str.replace(r'\D', ''))

def cs2_ted(df):
    # `str.extract` based solution, similar to @Ted Petrou's. so timing together.
    return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False))

def cs1_listcomp(df):
    return df.assign(result=[p1.sub('', x) for x in df['result']])

def cs2_listcomp(df):
    return df.assign(result=[p2.search(x)[0] for x in df['result']])

def cs_eumiro_listcomp(df):
    return df.assign(
        result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']])

def cs_mb_listcomp(df):
    return df.assign(result=[x[1:-1] for x in df['result']])

回答 2

我会使用熊猫替换功能,因为您可以使用正则表达式,所以它非常简单而强大。在下面,我使用正则表达式\ D删除所有非数字字符,但显然,使用正则表达式可以变得很有创意。

data['result'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

i’d use the pandas replace function, very simple and powerful as you can use regex. Below i’m using the regex \D to remove any non-digit characters but obviously you could get quite creative with regex.

data['result'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

回答 3

在特定情况下,如果您知道要从数据框列中删除的位置数,则可以在lambda函数内使用字符串索引来摆脱这些部分:

最后符:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

前两个字符:

data['result'] = data['result'].map(lambda x: str(x)[2:])

In the particular case where you know the number of positions that you want to remove from the dataframe column, you can use string indexing inside a lambda function to get rid of that parts:

Last character:

data['result'] = data['result'].map(lambda x: str(x)[:-1])

First two characters:

data['result'] = data['result'].map(lambda x: str(x)[2:])

回答 4

这里有一个错误:目前无法将参数传递给str.lstripstr.rstrip

http://github.com/pydata/pandas/issues/2411

编辑:2012-12-07这现在可以在dev分支上工作:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]: 
1     52
2     62
3     44
4     30
5    110
Name: result

There’s a bug here: currently cannot pass arguments to str.lstrip and str.rstrip:

http://github.com/pydata/pandas/issues/2411

EDIT: 2012-12-07 this works now on the dev branch:

In [8]: df['result'].str.lstrip('+-').str.rstrip('aAbBcC')
Out[8]: 
1     52
2     62
3     44
4     30
5    110
Name: result

回答 5

一种非常简单的方法是使用该extract方法选择所有数字。只需为其提供'\d+'可提取任意数字的正则表达式即可。

df['result'] = df.result.str.extract(r'(\d+)', expand=True).astype(int)
df

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110

A very simple method would be to use the extract method to select all the digits. Simply supply it the regular expression '\d+' which extracts any number of digits.

df['result'] = df.result.str.extract(r'(\d+)', expand=True).astype(int)
df

    time  result
1  09:00      52
2  10:00      62
3  11:00      44
4  12:00      30
5  13:00     110

回答 6

对于这些类型的任务,我经常使用列表推导,因为它们通常更快。

进行这种操作的各种方法(例如,修改DataFrame中序列的每个元素)的性能可能存在很大差异。通常,列表理解可能是最快的-有关此任务,请参见下面的代码竞赛:

import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop

I often use list comprehensions for these types of tasks because they’re often faster.

There can be big differences in performance between the various methods for doing things like this (i.e. modifying every element of a series within a DataFrame). Often a list comprehension can be fastest – see code race below for this task:

import pandas as pd
#Map
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))
10000 loops, best of 3: 187 µs per loop
#List comprehension
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in data['result']]
10000 loops, best of 3: 117 µs per loop
#.str
data = pd.DataFrame({'time':['09:00','10:00','11:00','12:00','13:00'], 'result':['+52A','+62B','+44a','+30b','-110a']})
%timeit data['result'] = data['result'].str.lstrip('+-').str.rstrip('aAbBcC')
1000 loops, best of 3: 336 µs per loop

回答 7

假设您的DF在数字之间也有那些多余的字符。

  result   time
0   +52A  09:00
1   +62B  10:00
2   +44a  11:00
3   +30b  12:00
4  -110a  13:00
5   3+b0  14:00

您可以尝试str.replace删除字符,不仅从开头和结尾,而且从中间删除。

DF['result'] = DF['result'].str.replace('\+|a|b|\-|A|B', '')

输出:

  result   time
0     52  09:00
1     62  10:00
2     44  11:00
3     30  12:00
4    110  13:00
5     30  14:00

Suppose your DF is having those extra character in between numbers as well.The last entry.

  result   time
0   +52A  09:00
1   +62B  10:00
2   +44a  11:00
3   +30b  12:00
4  -110a  13:00
5   3+b0  14:00

You can try str.replace to remove characters not only from start and end but also from in between.

DF['result'] = DF['result'].str.replace('\+|a|b|\-|A|B', '')

Output:

  result   time
0     52  09:00
1     62  10:00
2     44  11:00
3     30  12:00
4    110  13:00
5     30  14:00

回答 8

使用正则表达式尝试:

import re
data['result'] = data['result'].map(lambda x: re.sub('[-+A-Za-z]',x)

Try this using regular expression:

import re
data['result'] = data['result'].map(lambda x: re.sub('[-+A-Za-z]',x)

将熊猫数据框列表连接在一起

问题:将熊猫数据框列表连接在一起

我有一个熊猫数据框列表,我想将其合并为一个熊猫数据框。我正在使用Python 2.7.10和Pandas 0.16.2

我从以下位置创建了数据框列表:

import pandas as pd
dfs = []
sqlall = "select * from mytable"

for chunk in pd.read_sql_query(sqlall , cnxn, chunksize=10000):
    dfs.append(chunk)

这将返回数据帧列表

type(dfs[0])
Out[6]: pandas.core.frame.DataFrame

type(dfs)
Out[7]: list

len(dfs)
Out[8]: 408

这是一些示例数据

# sample dataframes
d1 = pd.DataFrame({'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]})
d2 = pd.DataFrame({'one' : [5., 6., 7., 8.], 'two' : [9., 10., 11., 12.]})
d3 = pd.DataFrame({'one' : [15., 16., 17., 18.], 'two' : [19., 10., 11., 12.]})

# list of dataframes
mydfs = [d1, d2, d3]

我想将d1d2和组合d3成一个熊猫数据框。另外,使用该chunksize选项时将大表直接读入数据框的方法将非常有帮助。

I have a list of Pandas dataframes that I would like to combine into one Pandas dataframe. I am using Python 2.7.10 and Pandas 0.16.2

I created the list of dataframes from:

import pandas as pd
dfs = []
sqlall = "select * from mytable"

for chunk in pd.read_sql_query(sqlall , cnxn, chunksize=10000):
    dfs.append(chunk)

This returns a list of dataframes

type(dfs[0])
Out[6]: pandas.core.frame.DataFrame

type(dfs)
Out[7]: list

len(dfs)
Out[8]: 408

Here is some sample data

# sample dataframes
d1 = pd.DataFrame({'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]})
d2 = pd.DataFrame({'one' : [5., 6., 7., 8.], 'two' : [9., 10., 11., 12.]})
d3 = pd.DataFrame({'one' : [15., 16., 17., 18.], 'two' : [19., 10., 11., 12.]})

# list of dataframes
mydfs = [d1, d2, d3]

I would like to combine d1, d2, and d3 into one pandas dataframe. Alternatively, a method of reading a large-ish table directly into a dataframe when using the chunksize option would be very helpful.


回答 0

鉴于所有数据框都具有相同的列,您可以简单地将concat它们:

import pandas as pd
df = pd.concat(list_of_dataframes)

Given that all the dataframes have the same columns, you can simply concat them:

import pandas as pd
df = pd.concat(list_of_dataframes)

回答 1

如果数据帧的所有列都不相同,请尝试以下操作:

df = pd.DataFrame.from_dict(map(dict,df_list))

If the dataframes DO NOT all have the same columns try the following:

df = pd.DataFrame.from_dict(map(dict,df_list))

回答 2

您也可以使用函数式编程来做到这一点:

from functools import reduce
reduce(lambda df1, df2: df1.merge(df2, "outer"), mydfs)

You also can do it with functional programming:

from functools import reduce
reduce(lambda df1, df2: df1.merge(df2, "outer"), mydfs)

回答 3

concat 对于使用“ loc”命令针对现有数据框提取的列表理解也可以很好地工作

df = pd.read_csv('./data.csv') # ie; Dataframe pulled from csv file with a "userID" column

review_ids = ['1','2','3'] # ie; ID values to grab from DataFrame

# Gets rows in df where IDs match in the userID column and combines them 

dfa = pd.concat([df.loc[df['userID'] == x] for x in review_ids])

concat also works nicely with a list comprehension pulled using the “loc” command against an existing dataframe

df = pd.read_csv('./data.csv') # ie; Dataframe pulled from csv file with a "userID" column

review_ids = ['1','2','3'] # ie; ID values to grab from DataFrame

# Gets rows in df where IDs match in the userID column and combines them 

dfa = pd.concat([df.loc[df['userID'] == x] for x in review_ids])

如何将新列添加到Spark DataFrame(使用PySpark)?

问题:如何将新列添加到Spark DataFrame(使用PySpark)?

我有一个Spark DataFrame(使用PySpark 1.5.1),想添加一个新列。

我已经尝试了以下方法,但没有成功:

type(randomed_hours) # => list

# Create in Python and transform to RDD

new_col = pd.DataFrame(randomed_hours, columns=['new_col'])

spark_new_col = sqlContext.createDataFrame(new_col)

my_df_spark.withColumn("hours", spark_new_col["new_col"])

使用此命令也出错:

my_df_spark.withColumn("hours",  sc.parallelize(randomed_hours))

那么,如何使用PySpark将新列(基于Python向量)添加到现有DataFrame中?

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column.

I’ve tried the following without any success:

type(randomed_hours) # => list

# Create in Python and transform to RDD

new_col = pd.DataFrame(randomed_hours, columns=['new_col'])

spark_new_col = sqlContext.createDataFrame(new_col)

my_df_spark.withColumn("hours", spark_new_col["new_col"])

Also got an error using this:

my_df_spark.withColumn("hours",  sc.parallelize(randomed_hours))

So how do I add a new column (based on Python vector) to an existing DataFrame with PySpark?


回答 0

您不能将任意列添加到DataFrameSpark中。只能通过使用文字来创建新列(其他文字类型在如何在Spark DataFrame中添加常量列中进行了描述)。

from pyspark.sql.functions import lit

df = sqlContext.createDataFrame(
    [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))

df_with_x4 = df.withColumn("x4", lit(0))
df_with_x4.show()

## +---+---+-----+---+
## | x1| x2|   x3| x4|
## +---+---+-----+---+
## |  1|  a| 23.0|  0|
## |  3|  B|-23.0|  0|
## +---+---+-----+---+

转换现有列:

from pyspark.sql.functions import exp

df_with_x5 = df_with_x4.withColumn("x5", exp("x3"))
df_with_x5.show()

## +---+---+-----+---+--------------------+
## | x1| x2|   x3| x4|                  x5|
## +---+---+-----+---+--------------------+
## |  1|  a| 23.0|  0| 9.744803446248903E9|
## |  3|  B|-23.0|  0|1.026187963170189...|
## +---+---+-----+---+--------------------+

包括使用join

from pyspark.sql.functions import exp

lookup = sqlContext.createDataFrame([(1, "foo"), (2, "bar")], ("k", "v"))
df_with_x6 = (df_with_x5
    .join(lookup, col("x1") == col("k"), "leftouter")
    .drop("k")
    .withColumnRenamed("v", "x6"))

## +---+---+-----+---+--------------------+----+
## | x1| x2|   x3| x4|                  x5|  x6|
## +---+---+-----+---+--------------------+----+
## |  1|  a| 23.0|  0| 9.744803446248903E9| foo|
## |  3|  B|-23.0|  0|1.026187963170189...|null|
## +---+---+-----+---+--------------------+----+

或使用函数/ udf生成:

from pyspark.sql.functions import rand

df_with_x7 = df_with_x6.withColumn("x7", rand())
df_with_x7.show()

## +---+---+-----+---+--------------------+----+-------------------+
## | x1| x2|   x3| x4|                  x5|  x6|                 x7|
## +---+---+-----+---+--------------------+----+-------------------+
## |  1|  a| 23.0|  0| 9.744803446248903E9| foo|0.41930610446846617|
## |  3|  B|-23.0|  0|1.026187963170189...|null|0.37801881545497873|
## +---+---+-----+---+--------------------+----+-------------------+

在性能方面,pyspark.sql.functions映射到Catalyst表达式的内置函数()通常优于Python用户定义的函数。

如果要添加任意RDD的内容作为列,则可以

  • 行号添加到现有数据框
  • 调用zipWithIndexRDD并将其转换为数据帧
  • 使用索引作为连接键来连接两者

You cannot add an arbitrary column to a DataFrame in Spark. New columns can be created only by using literals (other literal types are described in How to add a constant column in a Spark DataFrame?)

from pyspark.sql.functions import lit

df = sqlContext.createDataFrame(
    [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))

df_with_x4 = df.withColumn("x4", lit(0))
df_with_x4.show()

## +---+---+-----+---+
## | x1| x2|   x3| x4|
## +---+---+-----+---+
## |  1|  a| 23.0|  0|
## |  3|  B|-23.0|  0|
## +---+---+-----+---+

transforming an existing column:

from pyspark.sql.functions import exp

df_with_x5 = df_with_x4.withColumn("x5", exp("x3"))
df_with_x5.show()

## +---+---+-----+---+--------------------+
## | x1| x2|   x3| x4|                  x5|
## +---+---+-----+---+--------------------+
## |  1|  a| 23.0|  0| 9.744803446248903E9|
## |  3|  B|-23.0|  0|1.026187963170189...|
## +---+---+-----+---+--------------------+

included using join:

from pyspark.sql.functions import exp

lookup = sqlContext.createDataFrame([(1, "foo"), (2, "bar")], ("k", "v"))
df_with_x6 = (df_with_x5
    .join(lookup, col("x1") == col("k"), "leftouter")
    .drop("k")
    .withColumnRenamed("v", "x6"))

## +---+---+-----+---+--------------------+----+
## | x1| x2|   x3| x4|                  x5|  x6|
## +---+---+-----+---+--------------------+----+
## |  1|  a| 23.0|  0| 9.744803446248903E9| foo|
## |  3|  B|-23.0|  0|1.026187963170189...|null|
## +---+---+-----+---+--------------------+----+

or generated with function / udf:

from pyspark.sql.functions import rand

df_with_x7 = df_with_x6.withColumn("x7", rand())
df_with_x7.show()

## +---+---+-----+---+--------------------+----+-------------------+
## | x1| x2|   x3| x4|                  x5|  x6|                 x7|
## +---+---+-----+---+--------------------+----+-------------------+
## |  1|  a| 23.0|  0| 9.744803446248903E9| foo|0.41930610446846617|
## |  3|  B|-23.0|  0|1.026187963170189...|null|0.37801881545497873|
## +---+---+-----+---+--------------------+----+-------------------+

Performance-wise, built-in functions (pyspark.sql.functions), which map to Catalyst expression, are usually preferred over Python user defined functions.

If you want to add content of an arbitrary RDD as a column you can


回答 1

要使用UDF添加列:

df = sqlContext.createDataFrame(
    [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def valueToCategory(value):
   if   value == 1: return 'cat1'
   elif value == 2: return 'cat2'
   ...
   else: return 'n/a'

# NOTE: it seems that calls to udf() must be after SparkContext() is called
udfValueToCategory = udf(valueToCategory, StringType())
df_with_cat = df.withColumn("category", udfValueToCategory("x1"))
df_with_cat.show()

## +---+---+-----+---------+
## | x1| x2|   x3| category|
## +---+---+-----+---------+
## |  1|  a| 23.0|     cat1|
## |  3|  B|-23.0|      n/a|
## +---+---+-----+---------+

To add a column using a UDF:

df = sqlContext.createDataFrame(
    [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def valueToCategory(value):
   if   value == 1: return 'cat1'
   elif value == 2: return 'cat2'
   ...
   else: return 'n/a'

# NOTE: it seems that calls to udf() must be after SparkContext() is called
udfValueToCategory = udf(valueToCategory, StringType())
df_with_cat = df.withColumn("category", udfValueToCategory("x1"))
df_with_cat.show()

## +---+---+-----+---------+
## | x1| x2|   x3| category|
## +---+---+-----+---------+
## |  1|  a| 23.0|     cat1|
## |  3|  B|-23.0|      n/a|
## +---+---+-----+---------+

回答 2

对于Spark 2.0

# assumes schema has 'age' column 
df.select('*', (df.age + 10).alias('agePlusTen'))

For Spark 2.0

# assumes schema has 'age' column 
df.select('*', (df.age + 10).alias('agePlusTen'))

回答 3

我们可以通过多种方式在pySpark中添加新列。

让我们首先创建一个简单的DataFrame。

date = [27, 28, 29, None, 30, 31]
df = spark.createDataFrame(date, IntegerType())

现在,让我们尝试将列值加倍并将其存储在新列中。PFB很少有不同的方法可以实现相同。

# Approach - 1 : using withColumn function
df.withColumn("double", df.value * 2).show()

# Approach - 2 : using select with alias function.
df.select("*", (df.value * 2).alias("double")).show()

# Approach - 3 : using selectExpr function with as clause.
df.selectExpr("*", "value * 2 as double").show()

# Approach - 4 : Using as clause in SQL statement.
df.createTempView("temp")
spark.sql("select *, value * 2 as double from temp").show()

有关Spark DataFrame函数的更多示例和说明,请访问我的博客

我希望这有帮助。

There are multiple ways we can add a new column in pySpark.

Let’s first create a simple DataFrame.

date = [27, 28, 29, None, 30, 31]
df = spark.createDataFrame(date, IntegerType())

Now let’s try to double the column value and store it in a new column. PFB few different approaches to achieve the same.

# Approach - 1 : using withColumn function
df.withColumn("double", df.value * 2).show()

# Approach - 2 : using select with alias function.
df.select("*", (df.value * 2).alias("double")).show()

# Approach - 3 : using selectExpr function with as clause.
df.selectExpr("*", "value * 2 as double").show()

# Approach - 4 : Using as clause in SQL statement.
df.createTempView("temp")
spark.sql("select *, value * 2 as double from temp").show()

For more examples and explanation on spark DataFrame functions, you can visit my blog.

I hope this helps.


回答 4

您可以udf在添加时定义一个新的column_name

u_f = F.udf(lambda :yourstring,StringType())
a.select(u_f().alias('column_name')

You can define a new udf when adding a column_name:

u_f = F.udf(lambda :yourstring,StringType())
a.select(u_f().alias('column_name')

回答 5

from pyspark.sql.functions import udf
from pyspark.sql.types import *
func_name = udf(
    lambda val: val, # do sth to val
    StringType()
)
df.withColumn('new_col', func_name(df.old_col))
from pyspark.sql.functions import udf
from pyspark.sql.types import *
func_name = udf(
    lambda val: val, # do sth to val
    StringType()
)
df.withColumn('new_col', func_name(df.old_col))

回答 6

我想提供一个非常相似的用例的通用示例:

用例:我的csv包含:

First|Third|Fifth
data|data|data
data|data|data
...billion more lines

我需要执行一些转换,最终的csv需要看起来像

First|Second|Third|Fourth|Fifth
data|null|data|null|data
data|null|data|null|data
...billion more lines

我需要执行此操作,因为这是某些模型定义的架构,并且我需要最终数据与SQL Bulk Inserts等具有互操作性。

所以:

1)我使用spark.read读取原始的csv,并将其称为“ df”。

2)我对数据做了一些处理。

3)我使用此脚本添加空列:

outcols = []
for column in MY_COLUMN_LIST:
    if column in df.columns:
        outcols.append(column)
    else:
        outcols.append(lit(None).cast(StringType()).alias('{0}'.format(column)))

df = df.select(outcols)

这样,您可以在加载csv之后构造架构(如果必须对许多表执行此操作,也可以对列进行重新排序)。

I would like to offer a generalized example for a very similar use case:

Use Case: I have a csv consisting of:

First|Third|Fifth
data|data|data
data|data|data
...billion more lines

I need to perform some transformations and the final csv needs to look like

First|Second|Third|Fourth|Fifth
data|null|data|null|data
data|null|data|null|data
...billion more lines

I need to do this because this is the schema defined by some model and I need for my final data to be interoperable with SQL Bulk Inserts and such things.

so:

1) I read the original csv using spark.read and call it “df”.

2) I do something to the data.

3) I add the null columns using this script:

outcols = []
for column in MY_COLUMN_LIST:
    if column in df.columns:
        outcols.append(column)
    else:
        outcols.append(lit(None).cast(StringType()).alias('{0}'.format(column)))

df = df.select(outcols)

In this way, you can structure your schema after loading a csv (would also work for reordering columns if you have to do this for many tables).


回答 7

添加列的最简单方法是使用“ withColumn”。由于数据框是使用sqlContext创建的,因此您必须指定架构或默认情况下可以在数据集中使用。如果指定了架构,则每次更改时工作量都会变得很乏味。

您可以考虑以下示例:

from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc) # SparkContext will be sc by default 

# Read the dataset of your choice (Already loaded with schema)
Data = sqlContext.read.csv("/path", header = True/False, schema = "infer", sep = "delimiter")

# For instance the data has 30 columns from col1, col2, ... col30. If you want to add a 31st column, you can do so by the following:
Data = Data.withColumn("col31", "Code goes here")

# Check the change 
Data.printSchema()

The simplest way to add a column is to use “withColumn”. Since the dataframe is created using sqlContext, you have to specify the schema or by default can be available in the dataset. If the schema is specified, the workload becomes tedious when changing every time.

Below is an example that you can consider:

from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc) # SparkContext will be sc by default 

# Read the dataset of your choice (Already loaded with schema)
Data = sqlContext.read.csv("/path", header = True/False, schema = "infer", sep = "delimiter")

# For instance the data has 30 columns from col1, col2, ... col30. If you want to add a 31st column, you can do so by the following:
Data = Data.withColumn("col31", "Code goes here")

# Check the change 
Data.printSchema()

回答 8

我们可以通过以下步骤直接向DataFrame添加其他列:

from pyspark.sql.functions import when
df = spark.createDataFrame([["amit", 30], ["rohit", 45], ["sameer", 50]], ["name", "age"])
df = df.withColumn("profile", when(df.age >= 40, "Senior").otherwise("Executive"))
df.show()

We can add additional columns to DataFrame directly with below steps:

from pyspark.sql.functions import when
df = spark.createDataFrame([["amit", 30], ["rohit", 45], ["sameer", 50]], ["name", "age"])
df = df.withColumn("profile", when(df.age >= 40, "Senior").otherwise("Executive"))
df.show()

使用pandas GroupBy.agg()对同一列进行多次聚合

问题:使用pandas GroupBy.agg()对同一列进行多次聚合

是否有熊猫内置的方法将两个不同的聚合函数f1, f2应用于同一列df["returns"],而无需agg()多次调用?

示例数据框:

import pandas as pd
import datetime as dt

pd.np.random.seed(0)
df = pd.DataFrame({
         "date"    :  [dt.date(2012, x, 1) for x in range(1, 11)], 
         "returns" :  0.05 * np.random.randn(10), 
         "dummy"   :  np.repeat(1, 10)
}) 

语法上错误但直观上正确的方法是:

# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})

显然,Python不允许重复的键。还有其他表达方式agg()吗?也许元组列表[(column, function)]可以更好地工作,以允许将多个函数应用于同一列?但agg()似乎它只接受字典。

除了定义仅在其中应用这两个功能的辅助功能之外,还有其他解决方法吗?(无论如何,这如何与聚合一起使用?)

Is there a pandas built-in way to apply two different aggregating functions f1, f2 to the same column df["returns"], without having to call agg() multiple times?

Example dataframe:

import pandas as pd
import datetime as dt

pd.np.random.seed(0)
df = pd.DataFrame({
         "date"    :  [dt.date(2012, x, 1) for x in range(1, 11)], 
         "returns" :  0.05 * np.random.randn(10), 
         "dummy"   :  np.repeat(1, 10)
}) 

The syntactically wrong, but intuitively right, way to do it would be:

# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})

Obviously, Python doesn’t allow duplicate keys. Is there any other manner for expressing the input to agg()? Perhaps a list of tuples [(column, function)] would work better, to allow multiple functions applied to the same column? But agg() seems like it only accepts a dictionary.

Is there a workaround for this besides defining an auxiliary function that just applies both of the functions inside of it? (How would this work with aggregation anyway?)


回答 0

您可以简单地将函数作为列表传递:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

或作为字典:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

You can simply pass the functions as a list:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

or as a dictionary:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

回答 1

TLDR;Pandas groupby.agg具有一种新的,更简单的语法,用于指定(1)多列聚合,以及(2)一列多个聚合。因此,要对大于等于0.25的熊猫执行此操作,请使用

df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))

           Mean       Sum
dummy                    
1      0.036901  0.369012

要么

df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')

           Mean       Sum
dummy                    
1      0.036901  0.369012

大熊猫> = 0.25:命名汇总

熊猫已经改变了行为,GroupBy.agg转而使用更直观的语法来指定命名聚合。请参阅0.25文档部分中的增强功能以及相关的GitHub问题GH18366GH26512

从文档中

为了支持特定于列的聚合并控制输出列名称,pandas接受特殊的语法GroupBy.agg(),称为“命名聚合”,其中

  • 关键字是输出列名称
  • 值是元组,其第一个元素是要选择的列,第二个元素是要应用于该列的聚合。Pandas为pandas.NamedAgg namedtuple提供了字段[‘column’,’aggfunc’],以使参数更清晰。像往常一样,聚合可以是可调用的或字符串别名。

您现在可以通过关键字参数传递一个元组。元组遵循的格式(<colName>, <aggFunc>)

import pandas as pd

pd.__version__                                                                                                                            
# '0.25.0.dev0+840.g989f912ee'

# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]
})

df.groupby('kind').agg(
    max_height=('height', 'max'), min_weight=('weight', 'min'),)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

另外,您可以使用pd.NamedAgg(本质上是namedtuple)使事情更明确。

df.groupby('kind').agg(
    max_height=pd.NamedAgg(column='height', aggfunc='max'), 
    min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

对于Series来说甚至更简单,只需将aggfunc传递给关键字参数即可。

df.groupby('kind')['height'].agg(max_height='max', min_height='min')    

      max_height  min_height
kind                        
cat          9.5         9.1
dog         34.0         6.0       

最后,如果您的列名不是有效的python标识符,请使用带有解包功能的字典:

df.groupby('kind')['height'].agg(**{'max height': 'max', ...})

熊猫<0.25

在最新版本的熊猫(最高可达0.24)中,如果使用字典为聚合输出指定列名,则会得到FutureWarning

df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed 
# in a future version

v0.20中不推荐使用字典重命名列。在较新版本的熊猫上,可以通过传递元组列表来更简单地指定它。如果以这种方式指定函数,则该列的所有函数都必须指定为(名称,函数)对的元组。

df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})

        returns          
            op1       op2
dummy                    
1      0.328953  0.032895

要么,

df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])

            op1       op2
dummy                    
1      0.328953  0.032895

TLDR; Pandas groupby.agg has a new, easier syntax for specifying (1) aggregations on multiple columns, and (2) multiple aggregations on a column. So, to do this for pandas >= 0.25, use

df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))

           Mean       Sum
dummy                    
1      0.036901  0.369012

OR

df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')

           Mean       Sum
dummy                    
1      0.036901  0.369012

Pandas >= 0.25: Named Aggregation

Pandas has changed the behavior of GroupBy.agg in favour of a more intuitive syntax for specifying named aggregations. See the 0.25 docs section on Enhancements as well as relevant GitHub issues GH18366 and GH26512.

From the documentation,

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields [‘column’, ‘aggfunc’] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

You can now pass a tuple via keyword arguments. The tuples follow the format of (<colName>, <aggFunc>).

import pandas as pd

pd.__version__                                                                                                                            
# '0.25.0.dev0+840.g989f912ee'

# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]
})

df.groupby('kind').agg(
    max_height=('height', 'max'), min_weight=('weight', 'min'),)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

Alternatively, you can use pd.NamedAgg (essentially a namedtuple) which makes things more explicit.

df.groupby('kind').agg(
    max_height=pd.NamedAgg(column='height', aggfunc='max'), 
    min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

It is even simpler for Series, just pass the aggfunc to a keyword argument.

df.groupby('kind')['height'].agg(max_height='max', min_height='min')    

      max_height  min_height
kind                        
cat          9.5         9.1
dog         34.0         6.0       

Lastly, if your column names aren’t valid python identifiers, use a dictionary with unpacking:

df.groupby('kind')['height'].agg(**{'max height': 'max', ...})

Pandas < 0.25

In more recent versions of pandas leading upto 0.24, if using a dictionary for specifying column names for the aggregation output, you will get a FutureWarning:

df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed 
# in a future version

Using a dictionary for renaming columns is deprecated in v0.20. On more recent versions of pandas, this can be specified more simply by passing a list of tuples. If specifying the functions this way, all functions for that column need to be specified as tuples of (name, function) pairs.

df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})

        returns          
            op1       op2
dummy                    
1      0.328953  0.032895

Or,

df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])

            op1       op2
dummy                    
1      0.328953  0.032895

回答 2

这样的事情会做:

In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]: 
              func2     func1
dummy                        
1     -4.263768e-16 -0.188565

Would something like this work:

In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]: 
              func2     func1
dummy                        
1     -4.263768e-16 -0.188565

将缺失的日期添加到熊猫数据框

问题:将缺失的日期添加到熊猫数据框

我的数据可以在给定日期包含多个事件,也可以在一个日期包含否事件。我接受这些事件,按日期计数并绘制它们。但是,当我绘制它们时,我的两个系列并不总是匹配。

idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()

在上面的代码中,idx变为30个日期范围。2013/09/01至2013/09/30但是S可能只有25或26天,因为在给定日期没有事件发生。然后,当我尝试绘制时,由于大小不匹配,我得到一个AssertionError:

fig, ax = plt.subplots()    
ax.bar(idx.to_pydatetime(), s, color='green')

解决这个问题的正确方法是什么?我是否要从IDX中删除没有值的日期,或者(我希望这样做)是将序列中缺少的日期添加为0(我希望这样做)?我希望有30天的完整图表(值为0)。如果这种方法正确,那么有关如何开始使用的任何建议?我需要某种动态reindex功能吗?

这是Sdf.groupby(['simpleDate']).size() )的代码段,请注意没有输入04和05。

09-02-2013     2
09-03-2013    10
09-06-2013     5
09-07-2013     1

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don’t always match.

idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()

In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013 However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:

fig, ax = plt.subplots()    
ax.bar(idx.to_pydatetime(), s, color='green')

What’s the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I’d rather do) is add to the series the missing date with a count of 0. I’d rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?

Here’s a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.

09-02-2013     2
09-03-2013    10
09-06-2013     5
09-07-2013     1

回答 0

您可以使用Series.reindex

import pandas as pd

idx = pd.date_range('09-01-2013', '09-30-2013')

s = pd.Series({'09-02-2013': 2,
               '09-03-2013': 10,
               '09-06-2013': 5,
               '09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
print(s)

Yield

2013-09-01     0
2013-09-02     2
2013-09-03    10
2013-09-04     0
2013-09-05     0
2013-09-06     5
2013-09-07     1
2013-09-08     0
...

You could use Series.reindex:

import pandas as pd

idx = pd.date_range('09-01-2013', '09-30-2013')

s = pd.Series({'09-02-2013': 2,
               '09-03-2013': 10,
               '09-06-2013': 5,
               '09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
print(s)

yields

2013-09-01     0
2013-09-02     2
2013-09-03    10
2013-09-04     0
2013-09-05     0
2013-09-06     5
2013-09-07     1
2013-09-08     0
...

回答 1

使用更快的解决方法.asfreq()。这不需要创建新索引即可在中调用.reindex()

# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'), 
                  pd.Timestamp('2012-05-04'), 
                  pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)

print(s.asfreq('D'))
2012-05-01    1.0
2012-05-02    NaN
2012-05-03    NaN
2012-05-04    2.0
2012-05-05    NaN
2012-05-06    3.0
Freq: D, dtype: float64

A quicker workaround is to use .asfreq(). This doesn’t require creation of a new index to call within .reindex().

# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'), 
                  pd.Timestamp('2012-05-04'), 
                  pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)

print(s.asfreq('D'))
2012-05-01    1.0
2012-05-02    NaN
2012-05-03    NaN
2012-05-04    2.0
2012-05-05    NaN
2012-05-06    3.0
Freq: D, dtype: float64

回答 2

一个问题是,reindex如果存在重复值,该操作将失败。假设我们正在处理带时间戳的数据,我们希望按日期将其编入索引:

df = pd.DataFrame({
    'timestamps': pd.to_datetime(
        ['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
    'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df

Yield

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-18  "2016-11-18 04:00:00"  d

由于2016-11-16日期重复,尝试重新编制索引:

all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)

失败与:

...
ValueError: cannot reindex from a duplicate axis

(这表示索引重复,而不是索引本身是重复项)

相反,我们可以使用.loc查找范围内所有日期的条目:

df.loc[all_days]

Yield

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-17  NaN                    NaN
2016-11-18  "2016-11-18 04:00:00"  d

fillna 如果需要,可用于色谱柱系列以填充空白。

One issue is that reindex will fail if there are duplicate values. Say we’re working with timestamped data, which we want to index by date:

df = pd.DataFrame({
    'timestamps': pd.to_datetime(
        ['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
    'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df

yields

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-18  "2016-11-18 04:00:00"  d

Due to the duplicate 2016-11-16 date, an attempt to reindex:

all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)

fails with:

...
ValueError: cannot reindex from a duplicate axis

(by this it means the index has duplicates, not that it is itself a dup)

Instead, we can use .loc to look up entries for all dates in range:

df.loc[all_days]

yields

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-17  NaN                    NaN
2016-11-18  "2016-11-18 04:00:00"  d

fillna can be used on the column series to fill blanks if needed.


回答 3

另一种方法是resample,除了缺少日期外,还可以处理重复的日期。例如:

df.resample('D').mean()

resample是一个类似的延迟操作,groupby因此您需要执行另一个操作。在这种情况下mean工作得很好,但你也可以使用许多其他的熊猫方法,如maxsum等。

这是原始数据,但带有“ 2013-09-03”的附加条目:

             val
date           
2013-09-02     2
2013-09-03    10
2013-09-03    20    <- duplicate date added to OP's data
2013-09-06     5
2013-09-07     1

结果如下:

             val
date            
2013-09-02   2.0
2013-09-03  15.0    <- mean of original values for 2013-09-03
2013-09-04   NaN    <- NaN b/c date not present in orig
2013-09-05   NaN    <- NaN b/c date not present in orig
2013-09-06   5.0
2013-09-07   1.0

我将遗漏的日期保留为NaN以便清楚地说明其工作原理,但是您可以fillna(0)根据OP的要求添加以零代替NaN的方法,也可以interpolate()根据相邻行使用类似非零值的填充方法。

An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:

df.resample('D').mean()

resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.

Here is the original data, but with an extra entry for ‘2013-09-03’:

             val
date           
2013-09-02     2
2013-09-03    10
2013-09-03    20    <- duplicate date added to OP's data
2013-09-06     5
2013-09-07     1

And here are the results:

             val
date            
2013-09-02   2.0
2013-09-03  15.0    <- mean of original values for 2013-09-03
2013-09-04   NaN    <- NaN b/c date not present in orig
2013-09-05   NaN    <- NaN b/c date not present in orig
2013-09-06   5.0
2013-09-07   1.0

I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.


回答 4

这是一种将缺失的日期填充到数据框中的好方法,您可以选择fill_valuedays_back填充和date_order排序对数据框进行排序的顺序():

def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):

    df.set_index(date_col_name,drop=True,inplace=True)
    df.index = pd.DatetimeIndex(df.index)
    d = datetime.now().date()
    d2 = d - timedelta(days = days_back)
    idx = pd.date_range(d2, d, freq = "D")
    df = df.reindex(idx,fill_value=fill_value)
    df[date_col_name] = pd.DatetimeIndex(df.index)

    return df

Here’s a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:

def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):

    df.set_index(date_col_name,drop=True,inplace=True)
    df.index = pd.DatetimeIndex(df.index)
    d = datetime.now().date()
    d2 = d - timedelta(days = days_back)
    idx = pd.date_range(d2, d, freq = "D")
    df = df.reindex(idx,fill_value=fill_value)
    df[date_col_name] = pd.DatetimeIndex(df.index)

    return df

熊猫中的datetime dtypes read_csv

问题:熊猫中的datetime dtypes read_csv

我正在读取具有多个datetime列的csv文件。我需要在读取文件时设置数据类型,但是日期时间似乎是个问题。例如:

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = ['datetime', 'datetime', 'str', 'float']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

运行时出现错误:

TypeError:不了解数据类型“ datetime”

事后通过pandas.to_datetime()转换列不是一个选项,我不知道哪些列将是datetime对象。该信息可以更改,并且可以从通知我的dtypes列表的任何信息中获取。

另外,我尝试用numpy.genfromtxt加载csv文件,在该函数中设置dtypes,然后转换为pandas.dataframe,但它会使数据乱码。任何帮助是极大的赞赏!

I’m reading in a csv file with multiple datetime columns. I’d need to set the data types upon reading in the file, but datetimes appear to be a problem. For instance:

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = ['datetime', 'datetime', 'str', 'float']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

When run gives a error:

TypeError: data type “datetime” not understood

Converting columns after the fact, via pandas.to_datetime() isn’t an option I can’t know which columns will be datetime objects. That information can change and comes from whatever informs my dtypes list.

Alternatively, I’ve tried to load the csv file with numpy.genfromtxt, set the dtypes in that function, and then convert to a pandas.dataframe but it garbles the data. Any help is greatly appreciated!


回答 0

为什么它不起作用

没有为read_csv设置datetime dtype,因为csv文件只能包含字符串,整数和浮点数。

将dtype设置为datetime将使熊猫将datetime解释为对象,这意味着您将以字符串结尾。

熊猫解决这个问题的方法

pandas.read_csv()函数具有名为parse_dates

使用此功能,您可以使用默认date_parserdateutil.parser.parser)快速将字符串,浮点数或整数转换为日期时间

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = {'col1': 'str', 'col2': 'str', 'col3': 'str', 'col4': 'float'}
parse_dates = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes, parse_dates=parse_dates)

这将导致熊猫读取col1col2作为字符串,它们很可能是字符串(“ 2016-05-05”等),并且在读取字符串之后,每一列的date_parser都会对该字符串起作用,并返回该函数返回的任何内容。

定义自己的日期解析功能:

pandas.read_csv()函数具有名为date_parser

将其设置为lambda函数将使该特定函数可用于日期解析。

GOTCHA警告

您必须为其提供功能,而不是功能的执行,因此这是正确的

date_parser = pd.datetools.to_datetime

这是不正确的

date_parser = pd.datetools.to_datetime()

熊猫0.22更新

pd.datetools.to_datetime 已移至 date_parser = pd.to_datetime

谢谢@stackoverYC

Why it does not work

There is no datetime dtype to be set for read_csv as csv files can only contain strings, integers and floats.

Setting a dtype to datetime will make pandas interpret the datetime as an object, meaning you will end up with a string.

Pandas way of solving this

The pandas.read_csv() function has a keyword argument called parse_dates

Using this you can on the fly convert strings, floats or integers into datetimes using the default date_parser (dateutil.parser.parser)

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = {'col1': 'str', 'col2': 'str', 'col3': 'str', 'col4': 'float'}
parse_dates = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes, parse_dates=parse_dates)

This will cause pandas to read col1 and col2 as strings, which they most likely are (“2016-05-05” etc.) and after having read the string, the date_parser for each column will act upon that string and give back whatever that function returns.

Defining your own date parsing function:

The pandas.read_csv() function also has a keyword argument called date_parser

Setting this to a lambda function will make that particular function be used for the parsing of the dates.

GOTCHA WARNING

You have to give it the function, not the execution of the function, thus this is Correct

date_parser = pd.datetools.to_datetime

This is incorrect:

date_parser = pd.datetools.to_datetime()

Pandas 0.22 Update

pd.datetools.to_datetime has been relocated to date_parser = pd.to_datetime

Thanks @stackoverYC


回答 1

有一个parse_dates参数read_csv可让您定义要视为日期或日期时间的列的名称:

date_cols = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=date_cols)

There is a parse_dates parameter for read_csv which allows you to define the names of the columns you want treated as dates or datetimes:

date_cols = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=date_cols)

回答 2

您可以尝试传递实际类型而不是字符串。

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

但是,如果没有任何可修改的数据,将很难诊断出来。

实际上,您可能希望熊猫将日期解析为时间戳记,因此可能是:

pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=True)

You might try passing actual types instead of strings.

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

But it’s going to be really hard to diagnose this without any of your data to tinker with.

And really, you probably want pandas to parse the the dates into TimeStamps, so that might be:

pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=True)

回答 3

我尝试使用dtypes = [datetime,…]选项,但是

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

我遇到以下错误:

TypeError: data type not understood

我唯一要做的更改是将datetime替换为datetime.datetime

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime.datetime, datetime.datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

I tried using the dtypes=[datetime, …] option, but

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

I encountered the following error:

TypeError: data type not understood

The only change I had to make is to replace datetime with datetime.datetime

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime.datetime, datetime.datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

如何从熊猫的两列中形成元组列

问题:如何从熊猫的两列中形成元组列

我有一个Pandas DataFrame,我想将’lat’和’long’列组合成一个元组。

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205482 entries, 0 to 209018
Data columns:
Month           205482  non-null values
Reported by     205482  non-null values
Falls within    205482  non-null values
Easting         205482  non-null values
Northing        205482  non-null values
Location        205482  non-null values
Crime type      205482  non-null values
long            205482  non-null values
lat             205482  non-null values
dtypes: float64(4), object(5)

我尝试使用的代码是:

def merge_two_cols(series): 
    return (series['lat'], series['long'])

sample['lat_long'] = sample.apply(merge_two_cols, axis=1)

但是,这返回以下错误:

---------------------------------------------------------------------------
 AssertionError                            Traceback (most recent call last)
<ipython-input-261-e752e52a96e6> in <module>()
      2     return (series['lat'], series['long'])
      3 
----> 4 sample['lat_long'] = sample.apply(merge_two_cols, axis=1)
      5

AssertionError: Block shape incompatible with manager 

我怎么解决这个问题?

I’ve got a Pandas DataFrame and I want to combine the ‘lat’ and ‘long’ columns to form a tuple.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205482 entries, 0 to 209018
Data columns:
Month           205482  non-null values
Reported by     205482  non-null values
Falls within    205482  non-null values
Easting         205482  non-null values
Northing        205482  non-null values
Location        205482  non-null values
Crime type      205482  non-null values
long            205482  non-null values
lat             205482  non-null values
dtypes: float64(4), object(5)

The code I tried to use was:

def merge_two_cols(series): 
    return (series['lat'], series['long'])

sample['lat_long'] = sample.apply(merge_two_cols, axis=1)

However, this returned the following error:

---------------------------------------------------------------------------
 AssertionError                            Traceback (most recent call last)
<ipython-input-261-e752e52a96e6> in <module>()
      2     return (series['lat'], series['long'])
      3 
----> 4 sample['lat_long'] = sample.apply(merge_two_cols, axis=1)
      5

AssertionError: Block shape incompatible with manager 

How can I solve this problem?


回答 0

适应吧zip。在处理列数据时,它很方便。

df['new_col'] = list(zip(df.lat, df.long))

与使用apply或相比,它不那么复杂且速度更快map。诸如此类的np.dstack速度是的两倍zip,但不会给您元组。

Get comfortable with zip. It comes in handy when dealing with column data.

df['new_col'] = list(zip(df.lat, df.long))

It’s less complicated and faster than using apply or map. Something like np.dstack is twice as fast as zip, but wouldn’t give you tuples.


回答 1

In [10]: df
Out[10]:
          A         B       lat      long
0  1.428987  0.614405  0.484370 -0.628298
1 -0.485747  0.275096  0.497116  1.047605
2  0.822527  0.340689  2.120676 -2.436831
3  0.384719 -0.042070  1.426703 -0.634355
4 -0.937442  2.520756 -1.662615 -1.377490
5 -0.154816  0.617671 -0.090484 -0.191906
6 -0.705177 -1.086138 -0.629708  1.332853
7  0.637496 -0.643773 -0.492668 -0.777344
8  1.109497 -0.610165  0.260325  2.533383
9 -1.224584  0.117668  1.304369 -0.152561

In [11]: df['lat_long'] = df[['lat', 'long']].apply(tuple, axis=1)

In [12]: df
Out[12]:
          A         B       lat      long                             lat_long
0  1.428987  0.614405  0.484370 -0.628298      (0.484370195967, -0.6282975278)
1 -0.485747  0.275096  0.497116  1.047605      (0.497115615839, 1.04760475074)
2  0.822527  0.340689  2.120676 -2.436831      (2.12067574274, -2.43683074367)
3  0.384719 -0.042070  1.426703 -0.634355      (1.42670326172, -0.63435462504)
4 -0.937442  2.520756 -1.662615 -1.377490     (-1.66261469102, -1.37749004179)
5 -0.154816  0.617671 -0.090484 -0.191906  (-0.0904840623396, -0.191905582481)
6 -0.705177 -1.086138 -0.629708  1.332853     (-0.629707821728, 1.33285348929)
7  0.637496 -0.643773 -0.492668 -0.777344   (-0.492667604075, -0.777344111021)
8  1.109497 -0.610165  0.260325  2.533383        (0.26032456699, 2.5333825651)
9 -1.224584  0.117668  1.304369 -0.152561     (1.30436900612, -0.152560909725)
In [10]: df
Out[10]:
          A         B       lat      long
0  1.428987  0.614405  0.484370 -0.628298
1 -0.485747  0.275096  0.497116  1.047605
2  0.822527  0.340689  2.120676 -2.436831
3  0.384719 -0.042070  1.426703 -0.634355
4 -0.937442  2.520756 -1.662615 -1.377490
5 -0.154816  0.617671 -0.090484 -0.191906
6 -0.705177 -1.086138 -0.629708  1.332853
7  0.637496 -0.643773 -0.492668 -0.777344
8  1.109497 -0.610165  0.260325  2.533383
9 -1.224584  0.117668  1.304369 -0.152561

In [11]: df['lat_long'] = df[['lat', 'long']].apply(tuple, axis=1)

In [12]: df
Out[12]:
          A         B       lat      long                             lat_long
0  1.428987  0.614405  0.484370 -0.628298      (0.484370195967, -0.6282975278)
1 -0.485747  0.275096  0.497116  1.047605      (0.497115615839, 1.04760475074)
2  0.822527  0.340689  2.120676 -2.436831      (2.12067574274, -2.43683074367)
3  0.384719 -0.042070  1.426703 -0.634355      (1.42670326172, -0.63435462504)
4 -0.937442  2.520756 -1.662615 -1.377490     (-1.66261469102, -1.37749004179)
5 -0.154816  0.617671 -0.090484 -0.191906  (-0.0904840623396, -0.191905582481)
6 -0.705177 -1.086138 -0.629708  1.332853     (-0.629707821728, 1.33285348929)
7  0.637496 -0.643773 -0.492668 -0.777344   (-0.492667604075, -0.777344111021)
8  1.109497 -0.610165  0.260325  2.533383        (0.26032456699, 2.5333825651)
9 -1.224584  0.117668  1.304369 -0.152561     (1.30436900612, -0.152560909725)

回答 2

熊猫有itertuples方法做到这一点:

list(df[['lat', 'long']].itertuples(index=False, name=None))

Pandas has the itertuples method to do exactly this:

list(df[['lat', 'long']].itertuples(index=False, name=None))

回答 3

我想补充一下df.values.tolist()。(只要您不介意获取列表列而不是元组)

import pandas as pd
import numpy as np

size = int(1e+07)
df = pd.DataFrame({'a': np.random.rand(size), 'b': np.random.rand(size)}) 

%timeit df.values.tolist()
1.47 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit list(zip(df.a,df.b))
1.92 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I’d like to add df.values.tolist(). (as long as you don’t mind to get a column of lists rather than tuples)

import pandas as pd
import numpy as np

size = int(1e+07)
df = pd.DataFrame({'a': np.random.rand(size), 'b': np.random.rand(size)}) 

%timeit df.values.tolist()
1.47 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit list(zip(df.a,df.b))
1.92 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

将Dataframe保存到csv直接保存到s3 Python

问题:将Dataframe保存到csv直接保存到s3 Python

我有一个要上传到新CSV文件的pandas DataFrame。问题是在将文件传输到s3之前,我不想在本地保存文件。是否有像to_csv这样的方法可以将数据帧直接写入s3?我正在使用boto3。
这是我到目前为止的内容:

import boto3
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
read_file = s3.get_object(Bucket, Key)
df = pd.read_csv(read_file['Body'])

# Make alterations to DataFrame

# Then export DataFrame to CSV through direct transfer to s3

I have a pandas DataFrame that I want to upload to a new CSV file. The problem is that I don’t want to save the file locally before transferring it to s3. Is there any method like to_csv for writing the dataframe to s3 directly? I am using boto3.
Here is what I have so far:

import boto3
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
read_file = s3.get_object(Bucket, Key)
df = pd.read_csv(read_file['Body'])

# Make alterations to DataFrame

# Then export DataFrame to CSV through direct transfer to s3

回答 0

您可以使用:

from io import StringIO # python3; python2: BytesIO 
import boto3

bucket = 'my_bucket_name' # already created on S3
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'df.csv').put(Body=csv_buffer.getvalue())

You can use:

from io import StringIO # python3; python2: BytesIO 
import boto3

bucket = 'my_bucket_name' # already created on S3
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'df.csv').put(Body=csv_buffer.getvalue())

回答 1

您可以直接使用S3路径。我正在使用Pandas 0.24.1

In [1]: import pandas as pd

In [2]: df = pd.DataFrame( [ [1, 1, 1], [2, 2, 2] ], columns=['a', 'b', 'c'])

In [3]: df
Out[3]:
   a  b  c
0  1  1  1
1  2  2  2

In [4]: df.to_csv('s3://experimental/playground/temp_csv/dummy.csv', index=False)

In [5]: pd.__version__
Out[5]: '0.24.1'

In [6]: new_df = pd.read_csv('s3://experimental/playground/temp_csv/dummy.csv')

In [7]: new_df
Out[7]:
   a  b  c
0  1  1  1
1  2  2  2

发行公告:

S3文件处理

熊猫现在使用s3fs处理S3连接。这不应破坏任何代码。但是,由于s3fs不是必需的依赖项,因此您将需要单独安装它,例如以前版本的panda中的boto。GH11915

You can directly use the S3 path. I am using Pandas 0.24.1

In [1]: import pandas as pd

In [2]: df = pd.DataFrame( [ [1, 1, 1], [2, 2, 2] ], columns=['a', 'b', 'c'])

In [3]: df
Out[3]:
   a  b  c
0  1  1  1
1  2  2  2

In [4]: df.to_csv('s3://experimental/playground/temp_csv/dummy.csv', index=False)

In [5]: pd.__version__
Out[5]: '0.24.1'

In [6]: new_df = pd.read_csv('s3://experimental/playground/temp_csv/dummy.csv')

In [7]: new_df
Out[7]:
   a  b  c
0  1  1  1
1  2  2  2

Release Note:

S3 File Handling

pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas. GH11915.


回答 2

我喜欢s3fs,它使您可以像本地文件系统一样(几乎)使用s3。

你可以这样做:

import s3fs

bytes_to_write = df.to_csv(None).encode()
fs = s3fs.S3FileSystem(key=key, secret=secret)
with fs.open('s3://bucket/path/to/file.csv', 'wb') as f:
    f.write(bytes_to_write)

s3fs只支持rbwb打开文件,这就是为什么我做这个模式bytes_to_write的东西。

I like s3fs which lets you use s3 (almost) like a local filesystem.

You can do this:

import s3fs

bytes_to_write = df.to_csv(None).encode()
fs = s3fs.S3FileSystem(key=key, secret=secret)
with fs.open('s3://bucket/path/to/file.csv', 'wb') as f:
    f.write(bytes_to_write)

s3fs supports only rb and wb modes of opening the file, that’s why I did this bytes_to_write stuff.


回答 3

这是最新的答案:

import s3fs

s3 = s3fs.S3FileSystem(anon=False)

# Use 'w' for py3, 'wb' for py2
with s3.open('<bucket-name>/<filename>.csv','w') as f:
    df.to_csv(f)

StringIO的问题在于它将吞噬您的内存。使用此方法,您将文件流式传输到s3,而不是将其转换为字符串,然后将其写入s3。将pandas数据框及其字符串副本保存在内存中似乎效率很低。

如果您在ec2 Instant中工作,则可以为其赋予IAM角色以使其能够写入s3,因此您无需直接传递凭据。但是,您也可以通过将凭据传递给S3FileSystem()功能来连接到存储桶。请参阅文档:https : //s3fs.readthedocs.io/en/latest/

This is a more up to date answer:

import s3fs

s3 = s3fs.S3FileSystem(anon=False)

# Use 'w' for py3, 'wb' for py2
with s3.open('<bucket-name>/<filename>.csv','w') as f:
    df.to_csv(f)

The problem with StringIO is that it will eat away at your memory. With this method, you are streaming the file to s3, rather than converting it to string, then writing it into s3. Holding the pandas dataframe and its string copy in memory seems very inefficient.

If you are working in an ec2 instant, you can give it an IAM role to enable writing it to s3, thus you dont need to pass in credentials directly. However, you can also connect to a bucket by passing credentials to the S3FileSystem() function. See documention:https://s3fs.readthedocs.io/en/latest/


回答 4

如果None将第一个参数传递to_csv()给数据,则将以字符串形式返回。从那里开始,只需一步即可将其上传到S3。

也可以将一个StringIO对象传递给to_csv(),但是使用字符串会更容易。

If you pass None as the first argument to to_csv() the data will be returned as a string. From there it’s an easy step to upload that to S3 in one go.

It should also be possible to pass a StringIO object to to_csv(), but using a string will be easier.


回答 5

您还可以使用AWS Data Wrangler

import awswrangler

session = awswrangler.Session()
session.pandas.to_csv(
    dataframe=df,
    path="s3://...",
)

请注意,由于它是并行上传的,因此它将分为几部分。

You can also use the AWS Data Wrangler:

import awswrangler as wr
    
wr.s3.to_csv(
    df=df,
    path="s3://...",
)

Note that it will handle multipart upload for you to make the upload faster.


回答 6

我发现也可以使用client,而不仅仅是resource

from io import StringIO
import boto3
s3 = boto3.client("s3",\
                  region_name=region_name,\
                  aws_access_key_id=aws_access_key_id,\
                  aws_secret_access_key=aws_secret_access_key)
csv_buf = StringIO()
df.to_csv(csv_buf, header=True, index=False)
csv_buf.seek(0)
s3.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key='path/test.csv')

I found this can be done using client also and not just resource.

from io import StringIO
import boto3
s3 = boto3.client("s3",\
                  region_name=region_name,\
                  aws_access_key_id=aws_access_key_id,\
                  aws_secret_access_key=aws_secret_access_key)
csv_buf = StringIO()
df.to_csv(csv_buf, header=True, index=False)
csv_buf.seek(0)
s3.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key='path/test.csv')

回答 7

由于您正在使用boto3.client(),请尝试:

import boto3
from io import StringIO #python3 
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
def copy_to_s3(client, df, bucket, filepath):
    csv_buf = StringIO()
    df.to_csv(csv_buf, header=True, index=False)
    csv_buf.seek(0)
    client.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key=filepath)
    print(f'Copy {df.shape[0]} rows to S3 Bucket {bucket} at {filepath}, Done!')

copy_to_s3(client=s3, df=df_to_upload, bucket='abc', filepath='def/test.csv')

since you are using boto3.client(), try:

import boto3
from io import StringIO #python3 
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
def copy_to_s3(client, df, bucket, filepath):
    csv_buf = StringIO()
    df.to_csv(csv_buf, header=True, index=False)
    csv_buf.seek(0)
    client.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key=filepath)
    print(f'Copy {df.shape[0]} rows to S3 Bucket {bucket} at {filepath}, Done!')

copy_to_s3(client=s3, df=df_to_upload, bucket='abc', filepath='def/test.csv')

回答 8

我找到了一个似乎很有效的简单解决方案:

s3 = boto3.client("s3")

s3.put_object(
    Body=open("filename.csv").read(),
    Bucket="your-bucket",
    Key="your-key"
)

希望能有所帮助!

I found a very simple solution that seems to be working :

s3 = boto3.client("s3")

s3.put_object(
    Body=open("filename.csv").read(),
    Bucket="your-bucket",
    Key="your-key"
)

Hope that helps !


回答 9

我从存储桶s3中读取了两列的csv,并将文件csv的内容放入了pandas数据框。

例:

config.json

{
  "credential": {
    "access_key":"xxxxxx",
    "secret_key":"xxxxxx"
}
,
"s3":{
       "bucket":"mybucket",
       "key":"csv/user.csv"
   }
}

cls_config.json

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import json

class cls_config(object):

    def __init__(self,filename):

        self.filename = filename


    def getConfig(self):

        fileName = os.path.join(os.path.dirname(__file__), self.filename)
        with open(fileName) as f:
        config = json.load(f)
        return config

cls_pandas.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd
import io

class cls_pandas(object):

    def __init__(self):
        pass

    def read(self,stream):

        df = pd.read_csv(io.StringIO(stream), sep = ",")
        return df

cls_s3.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import boto3
import json

class cls_s3(object):

    def  __init__(self,access_key,secret_key):

        self.s3 = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key)

    def getObject(self,bucket,key):

        read_file = self.s3.get_object(Bucket=bucket, Key=key)
        body = read_file['Body'].read().decode('utf-8')
        return body

test.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from cls_config import *
from cls_s3 import *
from cls_pandas import *

class test(object):

    def __init__(self):
        self.conf = cls_config('config.json')

    def process(self):

        conf = self.conf.getConfig()

        bucket = conf['s3']['bucket']
        key = conf['s3']['key']

        access_key = conf['credential']['access_key']
        secret_key = conf['credential']['secret_key']

        s3 = cls_s3(access_key,secret_key)
        ob = s3.getObject(bucket,key)

        pa = cls_pandas()
        df = pa.read(ob)

        print df

if __name__ == '__main__':
    test = test()
    test.process()

I read a csv with two columns from bucket s3, and the content of the file csv i put in pandas dataframe.

Example:

config.json

{
  "credential": {
    "access_key":"xxxxxx",
    "secret_key":"xxxxxx"
}
,
"s3":{
       "bucket":"mybucket",
       "key":"csv/user.csv"
   }
}

cls_config.json

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import json

class cls_config(object):

    def __init__(self,filename):

        self.filename = filename


    def getConfig(self):

        fileName = os.path.join(os.path.dirname(__file__), self.filename)
        with open(fileName) as f:
        config = json.load(f)
        return config

cls_pandas.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd
import io

class cls_pandas(object):

    def __init__(self):
        pass

    def read(self,stream):

        df = pd.read_csv(io.StringIO(stream), sep = ",")
        return df

cls_s3.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import boto3
import json

class cls_s3(object):

    def  __init__(self,access_key,secret_key):

        self.s3 = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key)

    def getObject(self,bucket,key):

        read_file = self.s3.get_object(Bucket=bucket, Key=key)
        body = read_file['Body'].read().decode('utf-8')
        return body

test.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from cls_config import *
from cls_s3 import *
from cls_pandas import *

class test(object):

    def __init__(self):
        self.conf = cls_config('config.json')

    def process(self):

        conf = self.conf.getConfig()

        bucket = conf['s3']['bucket']
        key = conf['s3']['key']

        access_key = conf['credential']['access_key']
        secret_key = conf['credential']['secret_key']

        s3 = cls_s3(access_key,secret_key)
        ob = s3.getObject(bucket,key)

        pa = cls_pandas()
        df = pa.read(ob)

        print df

if __name__ == '__main__':
    test = test()
    test.process()