标签归档:pandas

如何将熊猫数据框的索引转换为列?

问题:如何将熊猫数据框的索引转换为列?

这似乎很明显,但是我似乎无法弄清楚如何将数据帧的索引转换为列?

例如:

df=
        gi       ptt_loc
 0  384444683      593  
 1  384444684      594 
 2  384444686      596  

至,

df=
    index1    gi       ptt_loc
 0  0     384444683      593  
 1  1     384444684      594 
 2  2     384444686      596  

This seems rather obvious, but I can’t seem to figure out how to convert an index of data frame to a column?

For example:

df=
        gi       ptt_loc
 0  384444683      593  
 1  384444684      594 
 2  384444686      596  

To,

df=
    index1    gi       ptt_loc
 0  0     384444683      593  
 1  1     384444684      594 
 2  2     384444686      596  

回答 0

要么:

df['index1'] = df.index

.reset_index

df.reset_index(level=0, inplace=True)

因此,如果您有一个3级索引的多索引框架,例如:

>>> df
                       val
tick       tag obs        
2016-02-26 C   2    0.0139
2016-02-27 A   2    0.5577
2016-02-28 C   6    0.0303

并且要将索引中的第1级(tick)和第3级(obs)转换为列,您可以执行以下操作:

>>> df.reset_index(level=['tick', 'obs'])
          tick  obs     val
tag                        
C   2016-02-26    2  0.0139
A   2016-02-27    2  0.5577
C   2016-02-28    6  0.0303

either:

df['index1'] = df.index

or, .reset_index:

df.reset_index(level=0, inplace=True)

so, if you have a multi-index frame with 3 levels of index, like:

>>> df
                       val
tick       tag obs        
2016-02-26 C   2    0.0139
2016-02-27 A   2    0.5577
2016-02-28 C   6    0.0303

and you want to convert the 1st (tick) and 3rd (obs) levels in the index into columns, you would do:

>>> df.reset_index(level=['tick', 'obs'])
          tick  obs     val
tag                        
C   2016-02-26    2  0.0139
A   2016-02-27    2  0.5577
C   2016-02-28    6  0.0303

回答 1

对于MultiIndex,您可以使用以下方法提取其子索引

df['si_name'] = R.index.get_level_values('si_name') 

si_name索引的名称在哪里。

For MultiIndex you can extract its subindex using

df['si_name'] = R.index.get_level_values('si_name') 

where si_name is the name of the subindex.


回答 2

为了更加清楚,让我们看一下索引中具有两个级别的DataFrame(一个MultiIndex)。

index = pd.MultiIndex.from_product([['TX', 'FL', 'CA'], 
                                    ['North', 'South']], 
                                   names=['State', 'Direction'])

df = pd.DataFrame(index=index, 
                  data=np.random.randint(0, 10, (6,4)), 
                  columns=list('abcd'))

reset_index使用默认参数调用的方法将所有索引级别转换为列,并使用简单RangeIndex的新索引。

df.reset_index()

使用level参数控制将哪些索引级别转换为列。如果可能,请使用更明确的级别名称。如果没有级别名称,则可以通过其整数位置来引用每个级别,整数位置从外部开始为0。您可以在此处使用标量值或要重置的所有索引的列表。

df.reset_index(level='State') # same as df.reset_index(level=0)

在极少数情况下,您想要保留索引并将索引转换为列,可以执行以下操作:

# for a single level
df.assign(State=df.index.get_level_values('State'))

# for all levels
df.assign(**df.index.to_frame())

To provide a bit more clarity, let’s look at a DataFrame with two levels in its index (a MultiIndex).

index = pd.MultiIndex.from_product([['TX', 'FL', 'CA'], 
                                    ['North', 'South']], 
                                   names=['State', 'Direction'])

df = pd.DataFrame(index=index, 
                  data=np.random.randint(0, 10, (6,4)), 
                  columns=list('abcd'))

The reset_index method, called with the default parameters, converts all index levels to columns and uses a simple RangeIndex as new index.

df.reset_index()

Use the level parameter to control which index levels are converted into columns. If possible, use the level name, which is more explicit. If there are no level names, you can refer to each level by its integer location, which begin at 0 from the outside. You can use a scalar value here or a list of all the indexes you would like to reset.

df.reset_index(level='State') # same as df.reset_index(level=0)

In the rare event that you want to preserve the index and turn the index into a column, you can do the following:

# for a single level
df.assign(State=df.index.get_level_values('State'))

# for all levels
df.assign(**df.index.to_frame())

回答 3

rename_axis + reset_index

您可以先将索引重命名为所需的标签,然后提升为一系列:

df = df.rename_axis('index1').reset_index()

print(df)

   index1         gi  ptt_loc
0       0  384444683      593
1       1  384444684      594
2       2  384444686      596

这也适用于MultiIndex数据框:

print(df)
#                        val
# tick       tag obs        
# 2016-02-26 C   2    0.0139
# 2016-02-27 A   2    0.5577
# 2016-02-28 C   6    0.0303

df = df.rename_axis(['index1', 'index2', 'index3']).reset_index()

print(df)

       index1 index2  index3     val
0  2016-02-26      C       2  0.0139
1  2016-02-27      A       2  0.5577
2  2016-02-28      C       6  0.0303

rename_axis + reset_index

You can first rename your index to a desired label, then elevate to a series:

df = df.rename_axis('index1').reset_index()

print(df)

   index1         gi  ptt_loc
0       0  384444683      593
1       1  384444684      594
2       2  384444686      596

This works also for MultiIndex dataframes:

print(df)
#                        val
# tick       tag obs        
# 2016-02-26 C   2    0.0139
# 2016-02-27 A   2    0.5577
# 2016-02-28 C   6    0.0303

df = df.rename_axis(['index1', 'index2', 'index3']).reset_index()

print(df)

       index1 index2  index3     val
0  2016-02-26      C       2  0.0139
1  2016-02-27      A       2  0.5577
2  2016-02-28      C       6  0.0303

回答 4

如果要使用该reset_index方法并保留现有索引,则应使用:

df.reset_index().set_index('index', drop=False)

或更改它的位置:

df.reset_index(inplace=True)
df.set_index('index', drop=False, inplace=True)

例如:

print(df)
          gi  ptt_loc
0  384444683      593
4  384444684      594
9  384444686      596

print(df.reset_index())
   index         gi  ptt_loc
0      0  384444683      593
1      4  384444684      594
2      9  384444686      596

print(df.reset_index().set_index('index', drop=False))
       index         gi  ptt_loc
index
0          0  384444683      593
4          4  384444684      594
9          9  384444686      596

如果要摆脱索引标签,可以执行以下操作:

df2 = df.reset_index().set_index('index', drop=False)
df2.index.name = None
print(df2)
   index         gi  ptt_loc
0      0  384444683      593
4      4  384444684      594
9      9  384444686      596

If you want to use the reset_index method and also preserve your existing index you should use:

df.reset_index().set_index('index', drop=False)

or to change it in place:

df.reset_index(inplace=True)
df.set_index('index', drop=False, inplace=True)

For example:

print(df)
          gi  ptt_loc
0  384444683      593
4  384444684      594
9  384444686      596

print(df.reset_index())
   index         gi  ptt_loc
0      0  384444683      593
1      4  384444684      594
2      9  384444686      596

print(df.reset_index().set_index('index', drop=False))
       index         gi  ptt_loc
index
0          0  384444683      593
4          4  384444684      594
9          9  384444686      596

And if you want to get rid of the index label you can do:

df2 = df.reset_index().set_index('index', drop=False)
df2.index.name = None
print(df2)
   index         gi  ptt_loc
0      0  384444683      593
4      4  384444684      594
9      9  384444686      596

回答 5

df1 = pd.DataFrame({"gi":[232,66,34,43],"ptt":[342,56,662,123]})
p = df1.index.values
df1.insert( 0, column="new",value = p)
df1

    new     gi     ptt
0    0      232    342
1    1      66     56 
2    2      34     662
3    3      43     123
df1 = pd.DataFrame({"gi":[232,66,34,43],"ptt":[342,56,662,123]})
p = df1.index.values
df1.insert( 0, column="new",value = p)
df1

    new     gi     ptt
0    0      232    342
1    1      66     56 
2    2      34     662
3    3      43     123

回答 6

一种简单的方法是使用reset_index()方法。对于数据帧df,请使用以下代码:

df.reset_index(inplace=True)

这样,索引将成为一列,并且通过使用inplace作为True,这将成为永久更改。

A very simple way of doing this is to use reset_index() method.For a data frame df use the code below:

df.reset_index(inplace=True)

This way, the index will become a column, and by using inplace as True,this become permanent change.


从熊猫DataFrame中按部分字符串选择

问题:从熊猫DataFrame中按部分字符串选择

我有一个DataFrame4列,其中2个包含字符串值。我想知道是否有一种方法可以根据针对特定列的部分字符串匹配来选择行?

换句话说,一个函数或lambda函数将执行以下操作

re.search(pattern, cell_in_question) 

返回一个布尔值。我熟悉的语法,df[df['A'] == "hello world"]但似乎找不到用部分字符串匹配说的方法'hello'

有人可以指出正确的方向吗?

I have a DataFrame with 4 columns of which 2 contain string values. I was wondering if there was a way to select rows based on a partial string match against a particular column?

In other words, a function or lambda function that would do something like

re.search(pattern, cell_in_question) 

returning a boolean. I am familiar with the syntax of df[df['A'] == "hello world"] but can’t seem to find a way to do the same with a partial string match say 'hello'.

Would someone be able to point me in the right direction?


回答 0

基于github问题#620,看来您很快将能够执行以下操作:

df[df['A'].str.contains("hello")]

更新:熊猫0.8.1及更高版本中提供了矢量化字符串方法(即Series.str)

Based on github issue #620, it looks like you’ll soon be able to do the following:

df[df['A'].str.contains("hello")]

Update: vectorized string methods (i.e., Series.str) are available in pandas 0.8.1 and up.


回答 1

我尝试了上面提出的解决方案:

df[df["A"].str.contains("Hello|Britain")]

并得到一个错误:

ValueError:无法使用包含NA / NaN值的数组进行遮罩

您可以将NA值转换为False,如下所示:

df[df["A"].str.contains("Hello|Britain", na=False)]

I tried the proposed solution above:

df[df["A"].str.contains("Hello|Britain")]

and got an error:

ValueError: cannot mask with array containing NA / NaN values

you can transform NA values into False, like this:

df[df["A"].str.contains("Hello|Britain", na=False)]

回答 2

如何从熊猫DataFrame中按部分字符串选择?

这篇文章是为想要

  • 在字符串列中搜索子字符串(最简单的情况)
  • 搜索多个子字符串(类似于isin
  • 匹配文本中的整个单词(例如,“蓝色”应匹配“天空是蓝色”,而不是“ bluejay”)
  • 匹配多个完整词
  • 了解“ ValueError:无法使用包含NA / NaN值的向量进行索引”背后的原因

…并想进一步了解应优先采用哪种方法。

(PS:我在类似主题上看到了很多问题,我认为最好把它留在这里。)


基本子串搜索

# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1

      col
0     foo
1  foobar
2     bar
3     baz

str.contains可用于执行子字符串搜索或基于正则表达式的搜索。搜索默认为基于正则表达式,除非您明确禁用它。

这是一个基于正则表达式的搜索示例,

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

有时,不需要进行正则表达式搜索,因此请指定regex=False为禁用它。

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.

      col
0     foo
1  foobar

在性能方面,正则表达式搜索比子字符串搜索慢:

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

如果不需要,请避免使用基于正则表达式的搜索。

解决ValueError小号
有时,执行字符串搜索和对结果的过滤会导致

ValueError: cannot index with vector containing NA / NaN values

这通常是由于对象列中存在混合数据或NaN,

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

非字符串的任何内容都不能应用字符串方法,因此结果自然是NaN。在这种情况下,请指定na=False忽略非字符串数据,

s.str.contains('foo|bar', na=False)

0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

多个子串搜索

通过使用正则表达式OR管道进行正则表达式搜索,最容易实现这一点。

# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4

          col
0     foo abc
1  foobar xyz
2       bar32
3      baz 45

df4[df4['col'].str.contains(r'foo|baz')]

          col
0     foo abc
1  foobar xyz
3      baz 45

您还可以创建一个术语列表,然后将其加入:

terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]

          col
0     foo abc
1  foobar xyz
3      baz 45

有时,明智的做法是将您的术语转义,以防它们包含可被解释为正则表达式元字符的字符。如果您的条款包含以下任何字符…

. ^ $ * + ? { } [ ] \ | ( )

然后,你就需要使用re.escape逃避它们:

import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]

          col
0     foo abc
1  foobar xyz
3      baz 45

re.escape 具有转义特殊字符的效果,因此可以按字面意义对待它们。

re.escape(r'.foo^')
# '\\.foo\\^'

匹配全词

默认情况下,子字符串搜索将搜索指定的子字符串/模式,而不管其是否为完整单词。为了仅匹配完整的单词,我们将需要在此处使用正则表达式-特别是,我们的模式将需要指定单词边界(\b)。

例如,

df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3

                     col
0        the sky is blue
1  bluejay by the window

现在考虑

df3[df3['col'].str.contains('blue')]

                     col
0        the sky is blue
1  bluejay by the window

伏/秒

df3[df3['col'].str.contains(r'\bblue\b')]

               col
0  the sky is blue

多个全字搜索

与上述类似,不同之处\b在于我们在连接的模式中添加了字边界()。

p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]

       col
0  foo abc
3   baz 45

p这个样子的,

p
# '\\b(?:foo|baz)\\b'

一个很好的选择:使用列表推导

因为你能!而且你应该!它们通常比字符串方法快一点,因为字符串方法难以向量化并且通常具有循环实现。

代替,

df1[df1['col'].str.contains('foo', regex=False)]

in在列表组合中使用运算符,

df1[['foo' in x for x in df1['col']]]

       col
0  foo abc
1   foobar

代替,

regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]

在列表组合中使用re.compile(用于缓存正则表达式)+ Pattern.search

p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]

      col
1  foobar

如果“ col”具有NaN,则代替

df1[df1['col'].str.contains(regex_pattern, na=False)]

采用,

def try_search(p, x):
    try:
        return bool(p.search(x))
    except TypeError:
        return False

p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]

      col
1  foobar

偏字符串匹配更多选项:np.char.findnp.vectorizeDataFrame.query

除了str.contains和列出理解,您还可以使用以下替代方法。

np.char.find
仅支持子字符串搜索(读取:无正则表达式)。

df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]

          col
0     foo abc
1  foobar xyz

np.vectorize
这是一个循环的包装器,但是比大多数pandas str方法要少。

f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True,  True, False, False])

df1[f(df1['col'], 'foo')]

       col
0  foo abc
1   foobar

正则表达式解决方案可能:

regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]

      col
1  foobar

DataFrame.query
通过python引擎支持字符串方法。这没有提供明显的性能优势,但是对于了解是否需要动态生成查询很有用。

df1.query('col.str.contains("foo")', engine='python')

      col
0     foo
1  foobar

有关更多信息queryeval方法系列,请参见使用pd.eval()在大熊猫中进行动态表达评估。


推荐用法

  1. (第一) str.contains,因为它简单易用,可以处理NaN和混合数据
  2. 列出其性能的理解(特别是如果您的数据是纯字符串)
  3. np.vectorize
  4. (持续) df.query

How do I select by partial string from a pandas DataFrame?

This post is meant for readers who want to

  • search for a substring in a string column (the simplest case)
  • search for multiple substrings (similar to isin)
  • match a whole word from text (e.g., “blue” should match “the sky is blue” but not “bluejay”)
  • match multiple whole words
  • Understand the reason behind “ValueError: cannot index with vector containing NA / NaN values”

…and would like to know more about what methods should be preferred over others.

(P.S.: I’ve seen a lot of questions on similar topics, I thought it would be good to leave this here.)


Basic Substring Search

# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1

      col
0     foo
1  foobar
2     bar
3     baz

str.contains can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.

Here is an example of regex-based search,

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

Sometimes regex search is not required, so specify regex=False to disable it.

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.

      col
0     foo
1  foobar

Performance wise, regex search is slower than substring search:

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Avoid using regex-based search if you don’t need it.

Addressing ValueErrors
Sometimes, performing a substring search and filtering on the result will result in

ValueError: cannot index with vector containing NA / NaN values

This is usually because of mixed data or NaNs in your object column,

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). In this case, specify na=False to ignore non-string data,

s.str.contains('foo|bar', na=False)

0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

Multiple Substring Search

This is most easily achieved through a regex search using the regex OR pipe.

# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4

          col
0     foo abc
1  foobar xyz
2       bar32
3      baz 45

df4[df4['col'].str.contains(r'foo|baz')]

          col
0     foo abc
1  foobar xyz
3      baz 45

You can also create a list of terms, then join them:

terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]

          col
0     foo abc
1  foobar xyz
3      baz 45

Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters…

. ^ $ * + ? { } [ ] \ | ( )

Then, you’ll need to use re.escape to escape them:

import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]

          col
0     foo abc
1  foobar xyz
3      baz 45

re.escape has the effect of escaping the special characters so they’re treated literally.

re.escape(r'.foo^')
# '\\.foo\\^'

Matching Entire Word(s)

By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here—in particular, our pattern will need to specify word boundaries (\b).

For example,

df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3

                     col
0        the sky is blue
1  bluejay by the window

Now consider,

df3[df3['col'].str.contains('blue')]

                     col
0        the sky is blue
1  bluejay by the window

v/s

df3[df3['col'].str.contains(r'\bblue\b')]

               col
0  the sky is blue

Multiple Whole Word Search

Similar to the above, except we add a word boundary (\b) to the joined pattern.

p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]

       col
0  foo abc
3   baz 45

Where p looks like this,

p
# '\\b(?:foo|baz)\\b'

A Great Alternative: Use List Comprehensions!

Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.

Instead of,

df1[df1['col'].str.contains('foo', regex=False)]

Use the in operator inside a list comp,

df1[['foo' in x for x in df1['col']]]

       col
0  foo abc
1   foobar

Instead of,

regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]

Use re.compile (to cache your regex) + Pattern.search inside a list comp,

p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]

      col
1  foobar

If “col” has NaNs, then instead of

df1[df1['col'].str.contains(regex_pattern, na=False)]

Use,

def try_search(p, x):
    try:
        return bool(p.search(x))
    except TypeError:
        return False

p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]

      col
1  foobar

More Options for Partial String Matching: np.char.find, np.vectorize, DataFrame.query.

In addition to str.contains and list comprehensions, you can also use the following alternatives.

np.char.find
Supports substring searches (read: no regex) only.

df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]

          col
0     foo abc
1  foobar xyz

np.vectorize
This is a wrapper around a loop, but with lesser overhead than most pandas str methods.

f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True,  True, False, False])

df1[f(df1['col'], 'foo')]

       col
0  foo abc
1   foobar

Regex solutions possible:

regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]

      col
1  foobar

DataFrame.query
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.

df1.query('col.str.contains("foo")', engine='python')

      col
0     foo
1  foobar

More information on query and eval family of methods can be found at Dynamic Expression Evaluation in pandas using pd.eval().


Recommended Usage Precedence

  1. (First) str.contains, for its simplicity and ease handling NaNs and mixed data
  2. List comprehensions, for its performance (especially if your data is purely strings)
  3. np.vectorize
  4. (Last) df.query

回答 3

如果有人想知道如何执行相关问题:“按部分字符串选择列”

采用:

df.filter(like='hello')  # select columns which contain the word hello

要通过部分字符串匹配选择行,请传递axis=0到过滤器:

# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)  

If anyone wonders how to perform a related problem: “Select column by partial string”

Use:

df.filter(like='hello')  # select columns which contain the word hello

And to select rows by partial string matching, pass axis=0 to filter:

# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)  

回答 4

快速说明:如果要基于索引中包含的部分字符串进行选择,请尝试以下操作:

df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]

Quick note: if you want to do selection based on a partial string contained in the index, try the following:

df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]

回答 5

说您有以下内容DataFrame

>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
       a            b
0  hello  hello world
1   abcd         defg

您始终可以in在lambda表达式中使用运算符来创建过滤器。

>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0     True
1    False
dtype: bool

这里的技巧是使用中的axis=1选项apply将元素逐行(而不是逐列)传递给lambda函数。

Say you have the following DataFrame:

>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
       a            b
0  hello  hello world
1   abcd         defg

You can always use the in operator in a lambda expression to create your filter.

>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0     True
1    False
dtype: bool

The trick here is to use the axis=1 option in the apply to pass elements to the lambda function row by row, as opposed to column by column.


回答 6

这就是我为部分字符串匹配所做的最终结果。如果有人有更有效的方法,请告诉我。

def stringSearchColumn_DataFrame(df, colName, regex):
    newdf = DataFrame()
    for idx, record in df[colName].iteritems():

        if re.search(regex, record):
            newdf = concat([df[df[colName] == record], newdf], ignore_index=True)

    return newdf

Here’s what I ended up doing for partial string matches. If anyone has a more efficient way of doing this please let me know.

def stringSearchColumn_DataFrame(df, colName, regex):
    newdf = DataFrame()
    for idx, record in df[colName].iteritems():

        if re.search(regex, record):
            newdf = concat([df[df[colName] == record], newdf], ignore_index=True)

    return newdf

回答 7

对于包含特殊字符的字符串,使用contains效果不佳。找到工作了。

df[df['A'].str.find("hello") != -1]

Using contains didn’t work well for my string with special characters. Find worked though.

df[df['A'].str.find("hello") != -1]

回答 8

在此之前,有一些答案可以完成所要求的功能,无论如何,我想以最普遍的方式展示:

df.filter(regex=".*STRING_YOU_LOOK_FOR.*")

这样,无论编写哪种方式,您都可以获取要查找的列。

(显然,您必须为每种情况编写正确的regex表达式)

There are answers before this which accomplish the asked feature, anyway I would like to show the most generally way:

df.filter(regex=".*STRING_YOU_LOOK_FOR.*")

This way let’s you get the column you look for whatever the way is wrote.

( Obviusly, you have to write the proper regex expression for each case )


回答 9

也许您想在Pandas数据框的所有列中搜索一些文本,而不仅仅是在它们的子集中。在这种情况下,以下代码将有所帮助。

df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]

警告。此方法相对较慢,但很方便。

Maybe you want to search for some text in all columns of the Pandas dataframe, and not just in the subset of them. In this case, the following code will help.

df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]

Warning. This method is relatively slow, albeit convenient.


回答 10

如果您需要在pandas dataframe列中进行不区分大小写的搜索,请执行以下操作:

df[df['A'].str.contains("hello", case=False)]

Should you need to do a case insensitive search for a string in a pandas dataframe column:

df[df['A'].str.contains("hello", case=False)]

使用pandas GroupBy获取每个组的统计信息(例如计数,均值等)?

问题:使用pandas GroupBy获取每个组的统计信息(例如计数,均值等)?

我有一个数据框,df并且从中使用了几列groupby

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

通过以上方法,我几乎得到了所需的表(数据框)。缺少的是另外一列,其中包含每个组中的行数。换句话说,我有意思,但我也想知道有多少个数字被用来获得这些价值。例如,在第一组中有8个值,在第二组中有10个值,依此类推。

简而言之:如何获取数据框的分组统计信息?

I have a data frame df and I use several columns from it to groupby:

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.

In short: How do I get group-wise statistics for a dataframe?


回答 0

groupby对象上,该agg函数可以列出一个列表,以一次应用多种聚合方法。这应该给您您需要的结果:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

回答 1

快速回答:

获取每个组的行数的最简单方法是调用.size(),它返回一个Series

df.groupby(['col1','col2']).size()


通常,您希望此结果为DataFrame(而不是Series),因此您可以执行以下操作:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


如果您想了解如何计算每组的行数和其他统计信息,请继续阅读下面的内容。


详细的例子:

考虑以下示例数据框:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

首先让我们.size()用来获取行数:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

然后让我们使用.size().reset_index(name='counts')来获取行数:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


包括结果以获取更多统计信息

当您要计算分组数据的统计信息时,通常如下所示:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

由于嵌套的列标签,并且行计数是基于每列的,因此上面的结果有点令人讨厌。

为了获得对输出的更多控制权,我通常将统计信息拆分为单独的汇总,然后使用进行合并join。看起来像这样:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



脚注

下面显示了用于生成测试数据的代码:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


免责声明:

如果您要聚合的某些列具有空值,那么您真的希望将组行计数视为每列的独立聚合。否则,您可能会误认为实际上有多少记录用于计算均值之类的东西,因为熊猫会NaN在均值计算中丢弃条目而不会告诉您。

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()


Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


Detailed example:

Consider the following example dataframe:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let’s use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let’s use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.


回答 2

一种功能统治一切: GroupBy.describe

返回countmeanstd,和其他有用的统计每个组。

df.groupby(['col1', 'col2'])['col3', 'col4'].describe()

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

from IPython.display import display

with pd.option_context('precision', 2):
    display(df.groupby(['A', 'B'])['C'].describe())

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

要获取特定的统计信息,只需选择它们,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

describe适用于多列(更改['C']为(['C', 'D']或完全删除),看看会发生什么,结果是一个MultiIndexed列数据框)。

您还将获得不同的字符串数据统计信息。这是一个例子

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

有关更多信息,请参见文档

One Function to Rule Them All: GroupBy.describe

Returns count, mean, std, and other useful statistics per-group.

df.groupby(['col1', 'col2'])['col3', 'col4'].describe()

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

from IPython.display import display

with pd.option_context('precision', 2):
    display(df.groupby(['A', 'B'])['C'].describe())

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

To get specific statistics, just select them,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

describe works for multiple columns (change ['C'] to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).

You also get different statistics for string data. Here’s an example,

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

For more information, see the documentation.


回答 3

我们可以使用groupby和count轻松地做到这一点。但是,我们应该记住使用reset_index()。

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

We can easily do it by using groupby and count. But, we should remember to use reset_index().

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

回答 4

要获取多个统计信息,请折叠索引并保留列名:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

生成:

To get multiple stats, collapse the index, and retain column names:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

Produces:


回答 5

创建一个组对象并调用如下示例所示的方法:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 

Create a group object and call methods like below example:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 

回答 6

请尝试此代码

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

我认为该代码将添加一个名为“ count it”的列,每个列的计数

Please try this code

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

I think that code will add a column called ‘count it’ which count of each group


随机播放DataFrame行

问题:随机播放DataFrame行

我有以下DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

从csv文件读取DataFrame。所有具有Type1的行都在最上面,然后是具有Type2 的行,然后是具有Type3 的行,依此类推。

我想重新整理DataFrame行的顺序,以便将所有行Type混合在一起。可能的结果可能是:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

我该如何实现?

I have the following DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

The DataFrame is read from a csv file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.

I would like to shuffle the order of the DataFrame’s rows, so that all Type‘s are mixed. A possible result could be:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

How can I achieve this?


回答 0

使用Pandas的惯用方式是使用.sample数据框的方法对所有行进行采样而无需替换:

df.sample(frac=1)

frac关键字参数指定的行的分数到随机样品中返回,所以frac=1装置返回所有行(随机顺序)。


注意: 如果您希望就地改组数据帧并重置索引,则可以执行例如

df = df.sample(frac=1).reset_index(drop=True)

在此,指定drop=True可防止.reset_index创建包含旧索引条目的列。

后续注解:尽管上面的操作似乎并不就位,但是python / pandas足够聪明,不会为经过改组的对象做另一个malloc。也就是说,即使参考对象已更改(我的意思id(df_old)是与相同id(df_new)),底层C对象仍然相同。为了证明确实如此,您可以运行一个简单的内存探查器:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

The idiomatic way to do this with Pandas is to use the .sample method of your dataframe to sample all rows without replacement:

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means return all rows (in random order).


Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)


回答 1

您可以为此简单地使用sklearn

from sklearn.utils import shuffle
df = shuffle(df)

You can simply use sklearn for this

from sklearn.utils import shuffle
df = shuffle(df)

回答 2

您可以通过使用改组后的索引建立索引来改组数据帧的行。为此,您可以使用np.random.permutation(但np.random.choice也可以):

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")

In [13]: df
Out[13]: 
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]: 
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

如果要像示例中那样将索引的编号始终保持为1、2,..,n,则只需重置索引即可: df_shuffled.reset_index(drop=True)

You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use np.random.permutation (but np.random.choice is also a possibility):

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")

In [13]: df
Out[13]: 
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]: 
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)


回答 3

TL; DRnp.random.shuffle(ndarray)可以胜任。
所以,在你的情况下

np.random.shuffle(DataFrame.values)

DataFrame在后台,使用NumPy ndarray作为数据持有者。(您可以从DataFrame源代码检查)

因此,如果使用np.random.shuffle(),它将沿多维数组的第一个轴随机排列数组。但是DataFrame遗体的索引仍然没有改组。

虽然,有一些要考虑的问题。

  • 函数不返回任何内容。如果要保留原始对象的副本,则必须这样做,然后再传递给该函数。
  • sklearn.utils.shuffle(),如用户tj89所建议的那样,可以指定random_state其他选项来控制输出。您可能需要出于开发目的。
  • sklearn.utils.shuffle()是比较快的。但洗牌的轴信息(索引,列)DataFrame与沿ndarray它包含的内容。

基准结果

sklearn.utils.shuffle()和之间np.random.shuffle()

ndarray

nd = sklearn.utils.shuffle(nd)

0.10793248389381915秒 快8倍

np.random.shuffle(nd)

0.8897626010002568秒

数据框

df = sklearn.utils.shuffle(df)

0.3183923360193148秒 快3倍

np.random.shuffle(df.values)

0.9357550159329548秒

结论:如果可以将轴信息(索引,列)与ndarray一起改组,请使用sklearn.utils.shuffle()。否则,使用np.random.shuffle()

使用的代码

import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''

timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)

TL;DR: np.random.shuffle(ndarray) can do the job.
So, in your case

np.random.shuffle(DataFrame.values)

DataFrame, under the hood, uses NumPy ndarray as data holder. (You can check from DataFrame source code)

So if you use np.random.shuffle(), it would shuffles the array along the first axis of a multi-dimensional array. But index of the DataFrame remains unshuffled.

Though, there are some points to consider.

  • function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
  • sklearn.utils.shuffle(), as user tj89 suggested, can designate random_state along with another option to control output. You may want that for dev purpose.
  • sklearn.utils.shuffle() is faster. But WILL SHUFFLE the axis info(index, column) of the DataFrame along with the ndarray it contains.

Benchmark result

between sklearn.utils.shuffle() and np.random.shuffle().

ndarray

nd = sklearn.utils.shuffle(nd)

0.10793248389381915 sec. 8x faster

np.random.shuffle(nd)

0.8897626010002568 sec

DataFrame

df = sklearn.utils.shuffle(df)

0.3183923360193148 sec. 3x faster

np.random.shuffle(df.values)

0.9357550159329548 sec

Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use sklearn.utils.shuffle(). Otherwise, use np.random.shuffle()

used code

import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''

timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)


回答 4

(我没有足够的声誉在最高职位上对此发表评论,所以我希望其他人可以为我这样做。)第一种方法引起了人们的关注:

df.sample(frac=1)

进行深拷贝或只是更改数据框。我运行了以下代码:

print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))

我的结果是:

0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70

这意味着该方法返回上一个注释中建议的相同对象。因此,此方法的确可以制作随机的副本

(I don’t have enough reputation to comment this on the top post, so I hope someone else can do that for me.) There was a concern raised that the first method:

df.sample(frac=1)

made a deep copy or just changed the dataframe. I ran the following code:

print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))

and my results were:

0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70

which means the method is not returning the same object, as was suggested in the last comment. So this method does indeed make a shuffled copy.


回答 5

还有用的是,如果将其用于Machine_learning并且希望始终分离相同的数据,则可以使用:

df.sample(n=len(df), random_state=42)

这样可以确保您的随机选择始终可复制

What is also useful, if you use it for Machine_learning and want to seperate always the same data, you could use:

df.sample(n=len(df), random_state=42)

this makes sure, that you keep your random choice always replicatable


回答 6

AFAIK最简单的解决方案是:

df_shuffled = df.reindex(np.random.permutation(df.index))

AFAIK the simplest solution is:

df_shuffled = df.reindex(np.random.permutation(df.index))

回答 7

通过取样阵列中的这种情况下,洗牌大熊猫数据帧索引和随机那么它的顺序来设置所述阵列的数据帧的索引。现在根据索引对数据帧进行排序。这是您经过改组的数据框

import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()

输出

    a   b
0   2   6
1   1   5
2   3   7
3   4   8

在上面的代码中将数据框插入我的位置。

shuffle the pandas data frame by taking a sample array in this case index and randomize its order then set the array as an index of data frame. Now sort the data frame according to index. Here goes your shuffled dataframe

import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()

output

    a   b
0   2   6
1   1   5
2   3   7
3   4   8

Insert you data frame in the place of mine in above code .


回答 8

这是另一种方式:

df['rnd'] = np.random.rand(len(df)) df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)

Here is another way:

df['rnd'] = np.random.rand(len(df)) df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)


如何像在SQL中一样使用’in’和’not in’过滤Pandas数据帧

问题:如何像在SQL中一样使用’in’和’not in’过滤Pandas数据帧

我怎样才能达到SQL IN和的等效NOT IN

我有一个包含所需值的列表。这是场景:

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

# pseudo-code:
df[df['countries'] not in countries]

我目前的做法如下:

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = pd.DataFrame({'countries':['UK','China'], 'matched':True})

# IN
df.merge(countries,how='inner',on='countries')

# NOT IN
not_in = df.merge(countries,how='left',on='countries')
not_in = not_in[pd.isnull(not_in['matched'])]

但这似乎是一个可怕的冲突。有人可以改进吗?

How can I achieve the equivalents of SQL’s IN and NOT IN?

I have a list with the required values. Here’s the scenario:

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

# pseudo-code:
df[df['countries'] not in countries]

My current way of doing this is as follows:

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = pd.DataFrame({'countries':['UK','China'], 'matched':True})

# IN
df.merge(countries,how='inner',on='countries')

# NOT IN
not_in = df.merge(countries,how='left',on='countries')
not_in = not_in[pd.isnull(not_in['matched'])]

But this seems like a horrible kludge. Can anyone improve on it?


回答 0

您可以使用pd.Series.isin

对于“ IN”使用: something.isin(somewhere)

或对于“ NOT IN”: ~something.isin(somewhere)

作为一个工作示例:

>>> df
  countries
0        US
1        UK
2   Germany
3     China
>>> countries
['UK', 'China']
>>> df.countries.isin(countries)
0    False
1     True
2    False
3     True
Name: countries, dtype: bool
>>> df[df.countries.isin(countries)]
  countries
1        UK
3     China
>>> df[~df.countries.isin(countries)]
  countries
0        US
2   Germany

You can use pd.Series.isin.

For “IN” use: something.isin(somewhere)

Or for “NOT IN”: ~something.isin(somewhere)

As a worked example:

>>> df
  countries
0        US
1        UK
2   Germany
3     China
>>> countries
['UK', 'China']
>>> df.countries.isin(countries)
0    False
1     True
2    False
3     True
Name: countries, dtype: bool
>>> df[df.countries.isin(countries)]
  countries
1        UK
3     China
>>> df[~df.countries.isin(countries)]
  countries
0        US
2   Germany

回答 1

使用.query()方法的替代解决方案:

In [5]: df.query("countries in @countries")
Out[5]:
  countries
1        UK
3     China

In [6]: df.query("countries not in @countries")
Out[6]:
  countries
0        US
2   Germany

Alternative solution that uses .query() method:

In [5]: df.query("countries in @countries")
Out[5]:
  countries
1        UK
3     China

In [6]: df.query("countries not in @countries")
Out[6]:
  countries
0        US
2   Germany

回答 2

Pandas DataFrame如何实现“ in”和“ not in”?

Pandas提供两种方法:Series.isinDataFrame.isin分别用于Series和DataFrames。


基于一个列过滤DataFrame(也适用于Series)

最常见的情况是isin在特定列上应用条件以过滤DataFrame中的行。

df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})
df
  countries
0        US
1        UK
2   Germany
3     China

c1 = ['UK', 'China']             # list
c2 = {'Germany'}                 # set
c3 = pd.Series(['China', 'US'])  # Series
c4 = np.array(['US', 'UK'])      # array

Series.isin接受各种类型的输入。以下是获得所需内容的所有有效方法:

df['countries'].isin(c1)

0    False
1     True
2    False
3    False
4     True
Name: countries, dtype: bool

# `in` operation
df[df['countries'].isin(c1)]

  countries
1        UK
4     China

# `not in` operation
df[~df['countries'].isin(c1)]

  countries
0        US
2   Germany
3       NaN

# Filter with `set` (tuples work too)
df[df['countries'].isin(c2)]

  countries
2   Germany

# Filter with another Series
df[df['countries'].isin(c3)]

  countries
0        US
4     China

# Filter with array
df[df['countries'].isin(c4)]

  countries
0        US
1        UK

在许多列上过滤

有时,您可能希望对多个列应用带有某些搜索字词的“参与”成员资格检查,

df2 = pd.DataFrame({
    'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})
df2

   A    B  C
0  x    w  0
1  y    a  1
2  z  NaN  2
3  q    x  3

c1 = ['x', 'w', 'p']

要将isin条件应用于“ A”和“ B”列,请使用DataFrame.isin

df2[['A', 'B']].isin(c1)

      A      B
0   True   True
1  False  False
2  False  False
3  False   True

由此,要保留至少一个列为的行True,我们可以any沿第一个轴使用:

df2[['A', 'B']].isin(c1).any(axis=1)

0     True
1    False
2    False
3     True
dtype: bool

df2[df2[['A', 'B']].isin(c1).any(axis=1)]

   A  B  C
0  x  w  0
3  q  x  3

请注意,如果要搜索每列,则只需省略列选择步骤,然后执行

df2.isin(c1).any(axis=1)

同样,要保留ALLTrueall列为的,请使用与以前相同的方式。

df2[df2[['A', 'B']].isin(c1).all(axis=1)]

   A  B  C
0  x  w  0

值得注意的提及:numpy.isin,,query列表理解(字符串数据)

除了上述方法外,您还可以使用numpy等效项:numpy.isin

# `in` operation
df[np.isin(df['countries'], c1)]

  countries
1        UK
4     China

# `not in` operation
df[np.isin(df['countries'], c1, invert=True)]

  countries
0        US
2   Germany
3       NaN

为什么值得考虑?NumPy函数通常比同等的熊猫要快一些,因为它们的开销较低。由于这是不依赖于索引对齐的元素操作,因此在极少数情况下此方法不能适当地替代pandas’ isin

在处理字符串时,Pandas例程通常是迭代的,因为字符串操作很难向量化。有大量证据表明,这里的列表理解会更快。。我们in现在求一张支票。

c1_set = set(c1) # Using `in` with `sets` is a constant time operation... 
                 # This doesn't matter for pandas because the implementation differs.
# `in` operation
df[[x in c1_set for x in df['countries']]]

  countries
1        UK
4     China

# `not in` operation
df[[x not in c1_set for x in df['countries']]]

  countries
0        US
2   Germany
3       NaN

但是,指定起来要麻烦得多,因此,除非您知道自己在做什么,否则不要使用它。

最后,此答案中DataFrame.query涵盖了这些内容。numexpr FTW!

How to implement ‘in’ and ‘not in’ for a pandas DataFrame?

Pandas offers two methods: Series.isin and DataFrame.isin for Series and DataFrames, respectively.


Filter DataFrame Based on ONE Column (also applies to Series)

The most common scenario is applying an isin condition on a specific column to filter rows in a DataFrame.

df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})
df
  countries
0        US
1        UK
2   Germany
3     China

c1 = ['UK', 'China']             # list
c2 = {'Germany'}                 # set
c3 = pd.Series(['China', 'US'])  # Series
c4 = np.array(['US', 'UK'])      # array

Series.isin accepts various types as inputs. The following are all valid ways of getting what you want:

df['countries'].isin(c1)

0    False
1     True
2    False
3    False
4     True
Name: countries, dtype: bool

# `in` operation
df[df['countries'].isin(c1)]

  countries
1        UK
4     China

# `not in` operation
df[~df['countries'].isin(c1)]

  countries
0        US
2   Germany
3       NaN

# Filter with `set` (tuples work too)
df[df['countries'].isin(c2)]

  countries
2   Germany

# Filter with another Series
df[df['countries'].isin(c3)]

  countries
0        US
4     China

# Filter with array
df[df['countries'].isin(c4)]

  countries
0        US
1        UK

Filter on MANY Columns

Sometimes, you will want to apply an ‘in’ membership check with some search terms over multiple columns,

df2 = pd.DataFrame({
    'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})
df2

   A    B  C
0  x    w  0
1  y    a  1
2  z  NaN  2
3  q    x  3

c1 = ['x', 'w', 'p']

To apply the isin condition to both columns “A” and “B”, use DataFrame.isin:

df2[['A', 'B']].isin(c1)

      A      B
0   True   True
1  False  False
2  False  False
3  False   True

From this, to retain rows where at least one column is True, we can use any along the first axis:

df2[['A', 'B']].isin(c1).any(axis=1)

0     True
1    False
2    False
3     True
dtype: bool

df2[df2[['A', 'B']].isin(c1).any(axis=1)]

   A  B  C
0  x  w  0
3  q  x  3

Note that if you want to search every column, you’d just omit the column selection step and do

df2.isin(c1).any(axis=1)

Similarly, to retain rows where ALL columns are True, use all in the same manner as before.

df2[df2[['A', 'B']].isin(c1).all(axis=1)]

   A  B  C
0  x  w  0

Notable Mentions: numpy.isin, query, list comprehensions (string data)

In addition to the methods described above, you can also use the numpy equivalent: numpy.isin.

# `in` operation
df[np.isin(df['countries'], c1)]

  countries
1        UK
4     China

# `not in` operation
df[np.isin(df['countries'], c1, invert=True)]

  countries
0        US
2   Germany
3       NaN

Why is it worth considering? NumPy functions are usually a bit faster than their pandas equivalents because of lower overhead. Since this is an elementwise operation that does not depend on index alignment, there are very few situations where this method is not an appropriate replacement for pandas’ isin.

Pandas routines are usually iterative when working with strings, because string operations are hard to vectorise. There is a lot of evidence to suggest that list comprehensions will be faster here.. We resort to an in check now.

c1_set = set(c1) # Using `in` with `sets` is a constant time operation... 
                 # This doesn't matter for pandas because the implementation differs.
# `in` operation
df[[x in c1_set for x in df['countries']]]

  countries
1        UK
4     China

# `not in` operation
df[[x not in c1_set for x in df['countries']]]

  countries
0        US
2   Germany
3       NaN

It is a lot more unwieldy to specify, however, so don’t use it unless you know what you’re doing.

Lastly, there’s also DataFrame.query which has been covered in this answer. numexpr FTW!


回答 3

我通常对这样的行进行通用过滤:

criterion = lambda row: row['countries'] not in countries
not_in = df[df.apply(criterion, axis=1)]

I’ve been usually doing generic filtering over rows like this:

criterion = lambda row: row['countries'] not in countries
not_in = df[df.apply(criterion, axis=1)]

回答 4

我想过滤出dfbc行,该行的BUSINESS_ID也在dfProfilesBusIds的BUSINESS_ID中

dfbc = dfbc[~dfbc['BUSINESS_ID'].isin(dfProfilesBusIds['BUSINESS_ID'])]

I wanted to filter out dfbc rows that had a BUSINESS_ID that was also in the BUSINESS_ID of dfProfilesBusIds

dfbc = dfbc[~dfbc['BUSINESS_ID'].isin(dfProfilesBusIds['BUSINESS_ID'])]

回答 5

从答案中整理可能的解决方案:

对于IN: df[df['A'].isin([3, 6])]

对于NOT IN:

  1. df[-df["A"].isin([3, 6])]

  2. df[~df["A"].isin([3, 6])]

  3. df[df["A"].isin([3, 6]) == False]

  4. df[np.logical_not(df["A"].isin([3, 6]))]

Collating possible solutions from the answers:

For IN: df[df['A'].isin([3, 6])]

For NOT IN:

  1. df[-df["A"].isin([3, 6])]

  2. df[~df["A"].isin([3, 6])]

  3. df[df["A"].isin([3, 6]) == False]

  4. df[np.logical_not(df["A"].isin([3, 6]))]


回答 6

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

实施于

df[df.countries.isin(countries)]

不在其他国家/地区实施

df[df.countries.isin([x for x in np.unique(df.countries) if x not in countries])]
df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

implement in:

df[df.countries.isin(countries)]

implement not in as in of rest countries:

df[df.countries.isin([x for x in np.unique(df.countries) if x not in countries])]

使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError

问题:使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError

我正在运行一个程序,正在处理30,000个类似文件。他们中有随机数正在停止并产生此错误…

   File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
     data = pd.read_csv(filepath, names=fields)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
     return _read(filepath_or_buffer, kwds)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
     return parser.read()
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
     ret = self._engine.read(nrows)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
     data = self._reader.read(nrows)
   File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
   File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
   File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
   File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
   File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
   File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
   File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
   File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

这些文件的源/创建都来自同一位置。纠正此错误以继续导入的最佳方法是什么?

I’m running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error…

   File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
     data = pd.read_csv(filepath, names=fields)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
     return _read(filepath_or_buffer, kwds)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
     return parser.read()
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
     ret = self._engine.read(nrows)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
     data = self._reader.read(nrows)
   File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
   File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
   File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
   File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
   File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
   File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
   File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
   File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

The source/creation of these files all come from the same place. What’s the best way to correct this to proceed with the import?


回答 0

read_csv可以encoding选择处理不同格式的文件。我主要使用read_csv('file', encoding = "ISO-8859-1"),或者替代地encoding = "utf-8"阅读,并且通常utf-8用于to_csv

您还可以使用而不是的多个alias选项'latin'之一'ISO-8859-1'(请参阅python docs,还可能会遇到许多其他编码)。

请参阅相关的Pandas文档有关csv文件的python文档示例以及有关SO的大量相关问题。一个好的背景资源是每个开发人员应了解的unicode和字符集

要检测编码(假设文件包含非ASCII字符),可以使用enca(请参见手册页)或file -i(linux)或file -I(osx)(请参见手册页)。

read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv.

You can also use one of several alias options like 'latin' instead of 'ISO-8859-1' (see python docs, also for numerous other encodings you may encounter).

See relevant Pandas documentation, python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.

To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).


回答 1

所有解决方案中最简单的:

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

替代解决方案:

  • Sublime文本编辑器中打开csv文件。
  • 以utf-8格式保存文件。

崇高地,单击文件->使用编码保存-> UTF-8

然后,您可以照常读取文件:

import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')

其他不同的编码类型是:

encoding = "cp1252"
encoding = "ISO-8859-1"

Simplest of all Solutions:

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

Alternate Solution:

  • Open the csv file in Sublime text editor.
  • Save the file in utf-8 format.

In sublime, Click File -> Save with encoding -> UTF-8

Then, you can read your file as usual:

import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')

and the other different encoding types are:

encoding = "cp1252"
encoding = "ISO-8859-1"

回答 2

熊猫允许指定编码,但不允许忽略错误以免自动替换有问题的字节。因此,没有一种适合所有方法的大小,而是取决于实际用例的不同方法。

  1. 您知道编码,并且文件中没有编码错误。太好了:您只需要指定编码即可:

    file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
  2. 您不希望被编码问题困扰,无论某些文本字段是否包含垃圾内容,都只希望加载该死的文件。好的,您只需要使用Latin1编码,因为它接受任何可能的字节作为输入(并将其转换为相同代码的unicode字符):

    pd.read_csv(input_file_and_path, ..., encoding='latin1')
  3. 您知道大多数文件都是用特定的编码编写的,但是它也包含编码错误。一个真实的示例是一个UTF8文件,该文件已使用非utf8编辑器进行了编辑,并且其中包含一些使用不同编码的行。Pandas没有提供特殊的错误处理的准备,但是Python open函数具有(假设Python3),并且read_csv接受像object这样的文件。在这里使用的典型错误参数是'ignore'仅抑制有问题的字节,或者(IMHO更好)'backslashreplace'用其Python的反斜杠转义序列替换有问题的字节:

    file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
    pd.read_csv(input_fd, ...)

Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits all method but different ways depending on the actual use case.

  1. You know the encoding, and there is no encoding error in the file. Great: you have just to specify the encoding:

    file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
    
  2. You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code):

    pd.read_csv(input_file_and_path, ..., encoding='latin1')
    
  3. You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence:

    file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
    pd.read_csv(input_fd, ...)
    

回答 3

with open('filename.csv') as f:
   print(f)

执行此代码后,您将找到“ filename.csv”的编码,然后执行以下代码

data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"

你去

with open('filename.csv') as f:
   print(f)

after executing this code you will find encoding of ‘filename.csv’ then execute code as following

data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"

there you go


回答 4

就我而言,USC-2 LE BOM根据Notepad ++ ,文件具有编码。它encoding="utf_16_le"用于python。

希望这有助于更快找到某人的答案。

In my case, a file has USC-2 LE BOM encoding, according to Notepad++. It is encoding="utf_16_le" for python.

Hope, it helps to find an answer a bit faster for someone.


回答 5

就我而言,这适用于python 2.7:

data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False) 

而对于python 3,仅:

data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False) 

In my case this worked for python 2.7:

data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False) 

And for python 3, only:

data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False) 

回答 6

尝试指定engine =’python’。它对我有用,但我仍在尝试找出原因。

df = pd.read_csv(input_file_path,...engine='python')

Try specifying the engine=’python’. It worked for me but I’m still trying to figure out why.

df = pd.read_csv(input_file_path,...engine='python')

回答 7

我正在发布答案,以提供有关为什么会出现此问题的更新解决方案和解释。假设您正在从数据库或Excel工作簿中获取此数据。如果您有特殊字符,例如La Cañada Flintridge city,除非您使用UTF-8编码导出数据,否则将引入错误。La Cañada Flintridge city将成为La Ca\xf1ada Flintridge city。如果您pandas.read_csv对默认参数没有任何调整,则会遇到以下错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte

幸运的是,有一些解决方案。

选项1,修复出口。确保使用UTF-8编码。

选项2,如果您无法解决出口问题,而需要使用pandas.read_csv,请确保包括以下参数engine='python'。缺省情况下,pandas使用engine='C'此选项非常适合读取大型干净文件,但如果出现意外情况,它将崩溃。根据我的经验,设置encoding='utf-8'从未解决过这个问题UnicodeDecodeError。另外,您不需要使用errors_bad_lines,但是,如果您确实需要它,那仍然是一个选择。

pd.read_csv(<your file>, engine='python')

选项3:解决方案是我个人首选的解决方案。使用香草Python读取文件。

import pandas as pd

data = []

with open(<your file>, "rb") as myfile:
    # read the header seperately
    # decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
    header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
    # read the rest of the data
    for line in myfile:
        row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
        data.append(row)

# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)

希望这可以帮助人们第一次遇到这个问题。

I am posting an answer to provide an updated solution and explanation as to why this problem can occur. Say you are getting this data from a database or Excel workbook. If you have special characters like La Cañada Flintridge city, well unless you are exporting the data using UTF-8 encoding, you’re going to introduce errors. La Cañada Flintridge city will become La Ca\xf1ada Flintridge city. If you are using pandas.read_csv without any adjustments to the default parameters, you’ll hit the following error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte

Fortunately, there are a few solutions.

Option 1, fix the exporting. Be sure to use UTF-8 encoding.

Option 2, if fixing the exporting problem is not available to you, and you need to use pandas.read_csv, be sure to include the following paramters, engine='python'. By default, pandas uses engine='C' which is great for reading large clean files, but will crash if anything unexpected comes up. In my experience, setting encoding='utf-8' has never fixed this UnicodeDecodeError. Also, you do not need to use errors_bad_lines, however, that is still an option if you REALLY need it.

pd.read_csv(<your file>, engine='python')

Option 3: solution is my preferred solution personally. Read the file using vanilla Python.

import pandas as pd

data = []

with open(<your file>, "rb") as myfile:
    # read the header seperately
    # decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
    header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
    # read the rest of the data
    for line in myfile:
        row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
        data.append(row)

# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)

Hope this helps people encountering this issue for the first time.


回答 8

挣扎了一段时间,以为我会在这个问题上发布,因为它是第一个搜索结果。将encoding="iso-8859-1"标签添加到熊猫read_csv没有用,也没有任何其他编码,但始终给出UnicodeDecodeError。

如果您要传递文件句柄,则pd.read_csv(),需要将encoding属性放在文件上,而不是中read_csv。事后看来很明显,但是要跟踪却有一个微妙的错误。

Struggled with this a while and thought I’d post on this question as it’s the first search result. Adding the encoding="iso-8859-1" tag to pandas read_csv didn’t work, nor did any other encoding, kept giving a UnicodeDecodeError.

If you’re passing a file handle to pd.read_csv(), you need to put the encoding attribute on the file open, not in read_csv. Obvious in hindsight, but a subtle error to track down.


回答 9

这个答案似乎可以解决CSV编码问题。如果标题出现奇怪的编码问题,如下所示:

>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])

然后,您在CSV文件的开头就有一个字节顺序标记(BOM)字符。这个答案解决了这个问题:

Python读取csv-BOM嵌入第一个密钥

解决方案是使用加载CSV encoding="utf-8-sig"

>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])

希望这对某人有帮助。

This answer seems to be the catch-all for CSV encoding issues. If you are getting a strange encoding problem with your header like this:

>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])

Then you have a byte order mark (BOM) character at the beginning of your CSV file. This answer addresses the issue:

Python read csv – BOM embedded into the first key

The solution is to load the CSV with encoding="utf-8-sig":

>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])

Hopefully this helps someone.


回答 10

我正在发布此旧线程的更新。我找到了一个可行的解决方案,但需要打开每个文件。我在LibreOffice中打开了csv文件,选择另存为>编辑过滤器设置。在下拉菜单中,我选择了UTF8编码。然后我添加encoding="utf-8-sig"data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig")

希望这对某人有帮助。

I am posting an update to this old thread. I found one solution that worked, but requires opening each file. I opened my csv file in LibreOffice, chose Save As > edit filter settings. In the drop-down menu I chose UTF8 encoding. Then I added encoding="utf-8-sig" to the data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig").

Hope this helps someone.


回答 11

我无法打开从网上银行下载的简体中文CSV文件,我尝试过latin1,尝试过iso-8859-1cp1252,但都无济于事。

但是pd.read_csv("",encoding ='gbk')工作就完成了。

I have trouble opening a CSV file in simplified Chinese downloaded from an online bank, I have tried latin1, I have tried iso-8859-1, I have tried cp1252, all to no avail.

But pd.read_csv("",encoding ='gbk') simply does the work.


回答 12

请尝试添加

encoding='unicode_escape'

这会有所帮助。为我工作。另外,请确保使用正确的定界符和列名。

您可以从仅加载1000行开始,以快速加载文件。

Please try to add

encoding='unicode_escape'

This will help. Worked for me. Also, make sure you’re using the correct delimiter and column names.

You can start with loading just 1000 rows to load the file quickly.


回答 13

我正在使用Jupyter笔记本。以我为例,它以错误的格式显示文件。“编码”选项无效。因此,我将CSV保存为utf-8格式,并且可以正常工作。

I am using Jupyter-notebook. And in my case, it was showing the file in the wrong format. The ‘encoding’ option was not working. So I save the csv in utf-8 format, and it works.


回答 14

尝试这个:

import pandas as pd
with open('filename.csv') as f:
    data = pd.read_csv(f)

看起来它会处理编码,而无需通过参数明确表示

Try this:

import pandas as pd
with open('filename.csv') as f:
    data = pd.read_csv(f)

Looks like it will take care of the encoding without explicitly expressing it through argument


回答 15

在传递给熊猫之前,请检查编码。它会使您减速,但是…

with open(path, 'r') as f:
    encoding = f.encoding 

df = pd.read_csv(path,sep=sep, encoding=encoding)

在python 3.7中

Check the encoding before you pass to pandas. It will slow you down, but…

with open(path, 'r') as f:
    encoding = f.encoding 

df = pd.read_csv(path,sep=sep, encoding=encoding)

In python 3.7


回答 16

我遇到的另一个导致相同错误的重要问题是:

_values = pd.read_csv("C:\Users\Mujeeb\Desktop\file.xlxs")

^此行导致相同的错误,因为我正在使用read_csv()方法读取Excel文件。使用read_excel()阅读.xlxs

Another important issue that I faced which resulted in the same error was:

_values = pd.read_csv("C:\Users\Mujeeb\Desktop\file.xlxs")

^This line resulted in the same error because I am reading an excel file using read_csv() method. Use read_excel() for reading .xlxs


如何避免Python / Pandas在保存的csv中创建索引?

问题:如何避免Python / Pandas在保存的csv中创建索引?

对文件进行一些编辑后,我试图将csv保存到文件夹。

每次我使用pd.to_csv('C:/Path of file.csv')csv文件时,都有单独的索引列。我想避免将索引打印到csv。

我试过了:

pd.read_csv('C:/Path to file to edit.csv', index_col = False)

并保存文件…

pd.to_csv('C:/Path to save edited file.csv', index_col = False)

但是,我仍然得到不需要的索引列。保存文件时如何避免这种情况?

I am trying to save a csv to a folder after making some edits to the file.

Every time I use pd.to_csv('C:/Path of file.csv') the csv file has a separate column of indexes. I want to avoid printing the index to csv.

I tried:

pd.read_csv('C:/Path to file to edit.csv', index_col = False)

And to save the file…

pd.to_csv('C:/Path to save edited file.csv', index_col = False)

However, I still got the unwanted index column. How can I avoid this when I save my files?


回答 0

使用index=False

df.to_csv('your.csv', index=False)

Use index=False.

df.to_csv('your.csv', index=False)

回答 1

有两种方法可以处理我们不希望将索引存储在csv文件中的情况。

  1. 正如其他人所述,将 数据框保存到csv文件时可以使用index = False

    df.to_csv('file_name.csv',index=False)

  2. 或者,您可以使用索引保存数据框,在读取时只需删除未命名的包含先前索引的0列即可!简单!

    df.to_csv(' file_name.csv ')
    df_new = pd.read_csv('file_name.csv').drop(['unnamed 0'],axis=1)

There are two ways to handle the situation where we do not want the index to be stored in csv file.

  1. As others have stated you can use index=False while saving your
    dataframe to csv file.

    df.to_csv('file_name.csv',index=False)

  2. Or you can save your dataframe as it is with an index, and while reading you just drop the column unnamed 0 containing your previous index.Simple!

    df.to_csv(' file_name.csv ')
    df_new = pd.read_csv('file_name.csv').drop(['unnamed 0'],axis=1)


回答 2

如果不需要索引,请使用以下命令读取文件:

import pandas as pd
df = pd.read_csv('file.csv', index_col=0)

使用保存

df.to_csv('file.csv', index=False)

If you want no index, read file using:

import pandas as pd
df = pd.read_csv('file.csv', index_col=0)

save it using

df.to_csv('file.csv', index=False)

回答 3

正如其他人所说,如果您不想首先保存索引列,则可以使用 df.to_csv('processed.csv', index=False)

但是,由于您通常使用的数据本身具有某种索引,因此我们假设使用“时间戳”列,因此我将保留索引并使用该索引加载数据。

因此,要保存索引数据,请首先设置其索引,然后保存DataFrame:

df.set_index('timestamp')
df.to_csv('processed.csv')

之后,您可以读取带有索引的数据:

pd.read_csv('processed.csv', index_col='timestamp')

或读取数据,然后设置索引:

pd.read_csv('filename.csv')
pd.set_index('column_name')

As others have stated, if you don’t want to save the index column in the first place, you can use df.to_csv('processed.csv', index=False)

However, since the data you will usually use, have some sort of index themselves, let’s say a ‘timestamp’ column, I would keep the index and load the data using it.

So, to save the indexed data, first set their index and then save the DataFrame:

df.set_index('timestamp')
df.to_csv('processed.csv')

Afterwards, you can either read the data with the index:

pd.read_csv('processed.csv', index_col='timestamp')

or read the data, and then set the index:

pd.read_csv('filename.csv')
pd.set_index('column_name')

回答 4

如果要将此列保留为索引,则可以采用另一种解决方案。

pd.read_csv('filename.csv', index_col='Unnamed: 0')

Another solution if you want to keep this column as index.

pd.read_csv('filename.csv', index_col='Unnamed: 0')

回答 5

如果您想要一个好的格式,那么下一条语句是最好的:

dataframe_prediction.to_csv('filename.csv', sep=',', encoding='utf-8', index=False)

在这种情况下,您将获得一个带有’,’的csv文件,该文件在各列和utf-8格式之间分开。另外,数字索引不会出现。

If you want a good format the next statement is the best:

dataframe_prediction.to_csv('filename.csv', sep=',', encoding='utf-8', index=False)

In this case you have got a csv file with ‘,’ as separate between columns and utf-8 format. In addition, numerical index won’t appear.


将多个csv文件导入到pandas中并串联到一个DataFrame中

问题:将多个csv文件导入到pandas中并串联到一个DataFrame中

我想将目录中的多个csv文件读入pandas,并将它们连接成一个大的DataFrame。我还无法弄清楚。这是我到目前为止的内容:

import glob
import pandas as pd

# get data file names
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

我想我在for循环中需要一些帮助吗???

I would like to read several csv files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:

import glob
import pandas as pd

# get data file names
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

I guess I need some help within the for loop???


回答 0

如果所有csv文件中的列均相同,则可以尝试以下代码。我已添加,header=0以便在读取后csv可以将第一行分配为列名。

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

If you have same columns in all your csv files then you can try the code below. I have added header=0 so that after reading csv first row can be assigned as the column names.

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

回答 1

替代darindaCoder的答案

path = r'C:\DRO\DCL_rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent

df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one

An alternative to darindaCoder’s answer:

path = r'C:\DRO\DCL_rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent

df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one

回答 2

import glob, os    
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))
import glob, os    
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))

回答 3

Dask库可以从多个文件读取数据帧:

>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')

(来源:http : //dask.pydata.org/en/latest/examples/dataframe-csv.html

Dask数据框实现了Pandas数据框API的子集。如果所有数据都适合内存,则可以调用df.compute()将数据框转换为Pandas数据框。

The Dask library can read a dataframe from multiple files:

>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')

(Source: http://dask.pydata.org/en/latest/examples/dataframe-csv.html)

The Dask dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute() to convert the dataframe into a Pandas dataframe.


回答 4

这里几乎所有答案都是不必要的复杂(全局模式匹配)或依赖于其他第三方库。您可以使用已内置的Pandas和python(所有版本)在2行中执行此操作。

对于一些文件-1个衬纸:

df = pd.concat(map(pd.read_csv, ['data/d1.csv', 'data/d2.csv','data/d3.csv']))

对于许多文件:

from os import listdir

filepaths = [f for f in listdir("./data") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))

设置df的这条熊猫线利用了3件事:

  1. Python的地图(函数,可迭代)发送到函数( pd.read_csv()可迭代(我们的列表)(是文件路径中的每个csv元素)。
  2. 熊猫的read_csv()函数可以正常读取每个CSV文件。
  3. 熊猫的concat()将所有这些都放在一个df变量下。

Almost all of the answers here are either unnecessarily complex (glob pattern matching) or rely on additional 3rd party libraries. You can do this in 2 lines using everything Pandas and python (all versions) already have built in.

For a few files – 1 liner:

df = pd.concat(map(pd.read_csv, ['data/d1.csv', 'data/d2.csv','data/d3.csv']))

For many files:

from os import listdir

filepaths = [f for f in listdir("./data") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))

This pandas line which sets the df utilizes 3 things:

  1. Python’s map (function, iterable) sends to the function (the pd.read_csv()) the iterable (our list) which is every csv element in filepaths).
  2. Panda’s read_csv() function reads in each CSV file as normal.
  3. Panda’s concat() brings all these under one df variable.

回答 5

编辑:我用谷歌搜索https://stackoverflow.com/a/21232849/186078。但是,最近我发现使用numpy进行任何操作,然后将其分配给数据框一次,而不是在迭代的基础上操纵数据框本身,这样更快,并且似乎也可以在此解决方案中工作。

我确实希望任何访问此页面的人都考虑采用这种方法,但又不想将这段巨大的代码作为注释并使其可读性降低。

您可以利用numpy真正加快数据帧的连接速度。

import os
import glob
import pandas as pd
import numpy as np

path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))


np_array_list = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    np_array_list.append(df.as_matrix())

comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)

big_frame.columns = ["col1","col2"....]

时间统计:

total files :192
avg lines per file :8492
--approach 1 without numpy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with numpy -- 2.289292573928833 seconds ---

Edit: I googled my way into https://stackoverflow.com/a/21232849/186078. However of late I am finding it faster to do any manipulation using numpy and then assigning it once to dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too.

I do sincerely want anyone hitting this page to consider this approach, but don’t want to attach this huge piece of code as a comment and making it less readable.

You can leverage numpy to really speed up the dataframe concatenation.

import os
import glob
import pandas as pd
import numpy as np

path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))


np_array_list = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    np_array_list.append(df.as_matrix())

comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)

big_frame.columns = ["col1","col2"....]

Timing stats:

total files :192
avg lines per file :8492
--approach 1 without numpy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with numpy -- 2.289292573928833 seconds ---

回答 6

如果要递归搜索Python 3.5或更高版本),则可以执行以下操作:

from glob import iglob
import pandas as pd

path = r'C:\user\your\path\**\*.csv'

all_rec = iglob(path, recursive=True)     
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)

请注意,最后三行可以用一行表示:

df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)

您可以在** 此处找到文档。另外,我用iglob代替glob,因为它返回一个迭代器而不是列表。



编辑:多平台递归函数:

您可以将以上内容包装到一个多平台功能(Linux,Windows,Mac)中,因此可以执行以下操作:

df = read_df_rec('C:\user\your\path', *.csv)

这是函数:

from glob import iglob
from os.path import join
import pandas as pd

def read_df_rec(path, fn_regex=r'*.csv'):
    return pd.concat((pd.read_csv(f) for f in iglob(
        join(path, '**', fn_regex), recursive=True)), ignore_index=True)

If you want to search recursively (Python 3.5 or above), you can do the following:

from glob import iglob
import pandas as pd

path = r'C:\user\your\path\**\*.csv'

all_rec = iglob(path, recursive=True)     
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)

Note that the three last lines can be expressed in one single line:

df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)

You can find the documentation of ** here. Also, I used iglobinstead of glob, as it returns an iterator instead of a list.



EDIT: Multiplatform recursive function:

You can wrap the above into a multiplatform function (Linux, Windows, Mac), so you can do:

df = read_df_rec('C:\user\your\path', *.csv)

Here is the function:

from glob import iglob
from os.path import join
import pandas as pd

def read_df_rec(path, fn_regex=r'*.csv'):
    return pd.concat((pd.read_csv(f) for f in iglob(
        join(path, '**', fn_regex), recursive=True)), ignore_index=True)

回答 7

方便快捷

导入两个或多个csv而不需要列出名称。

import glob

df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))

Easy and Fast

Import two or more csv‘s without having to make a list of names.

import glob

df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))

回答 8

一个衬里使用map,但是如果您要指定其他参数,则可以执行以下操作:

import pandas as pd
import glob
import functools

df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None), 
                    glob.glob("data/*.csv")))

注意:map本身不允许您提供其他参数。

one liner using map, but if you’d like to specify additional args, you could do:

import pandas as pd
import glob
import functools

df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None), 
                    glob.glob("data/*.csv")))

Note: map by itself does not let you supply additional args.


回答 9

如果压缩了多个csv文件,则可以使用zipfile读取全部内容并进行如下连接:

import zipfile
import numpy as np
import pandas as pd

ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')

train=[]

for f in range(0,len(ziptrain.namelist())):
    if (f == 0):
        train = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
    else:
        my_df = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
        train = (pd.DataFrame(np.concatenate((train,my_df),axis=0), 
                          columns=list(my_df.columns.values)))

If the multiple csv files are zipped, you may use zipfile to read all and concatenate as below:

import zipfile
import numpy as np
import pandas as pd

ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')

train=[]

for f in range(0,len(ziptrain.namelist())):
    if (f == 0):
        train = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
    else:
        my_df = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
        train = (pd.DataFrame(np.concatenate((train,my_df),axis=0), 
                          columns=list(my_df.columns.values)))

回答 10

另一个具有列表理解功能的内联函数,它允许将参数与read_csv一起使用。

df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])

Another on-liner with list comprehension which allows to use arguments with read_csv.

df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])

回答 11

基于@Sid的正确答案。

串联之前,您可以将csv文件加载到中间字典中,该字典可以根据文件名(格式为dict_of_df['filename.csv'])访问每个数据集。例如,当列名未对齐时,此类词典可帮助您识别异构数据格式的问题。

导入模块并找到文件路径:

import os
import glob
import pandas
from collections import OrderedDict
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

注意:OrderedDict不是必需的,但是它将保留文件顺序,这可能对分析有用。

将csv文件加载到字典中。然后连接:

dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
pandas.concat(dict_of_df, sort=True)

键是文件名f,值是csv文件的数据帧内容。除了f用作字典键之外,还可以使用os.path.basename(f)或其他os.path方法将字典中键的大小减小到仅相关的较小部分。

Based on @Sid’s good answer.

Before concatenating, you can load csv files into an intermediate dictionary which gives access to each data set based on the file name (in the form dict_of_df['filename.csv']). Such a dictionary can help you identify issues with heterogeneous data formats, when column names are not aligned for example.

Import modules and locate file paths:

import os
import glob
import pandas
from collections import OrderedDict
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

Note: OrderedDict is not necessary, but it’ll keep the order of files which might be useful for analysis.

Load csv files into a dictionary. Then concatenate:

dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
pandas.concat(dict_of_df, sort=True)

Keys are file names f and values are the data frame content of csv files. Instead of using f as a dictionary key, you can also use os.path.basename(f) or other os.path methods to reduce the size of the key in the dictionary to only the smaller part that is relevant.


回答 12

使用pathlib库的替代方法(通常首选而不是os.path)。

此方法避免了pandas concat()/的迭代使用apped()

从pandas文档中:
值得注意的是,concat()(因此,append())会完整复制数据,并且不断重用此函数可能会对性能产生重大影响。如果需要对多个数据集使用该操作,请使用列表推导。

import pandas as pd
from pathlib import Path

dir = Path("../relevant_directory")

df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)

Alternative using the pathlib library (often preferred over os.path).

This method avoids iterative use of pandas concat()/apped().

From the pandas documentation:
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

import pandas as pd
from pathlib import Path

dir = Path("../relevant_directory")

df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)

回答 13

这是在Google云端硬盘上使用Colab的方式

import pandas as pd
import glob

path = r'/content/drive/My Drive/data/actual/comments_only' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True,sort=True)
frame.to_csv('/content/drive/onefile.csv')

This is how you can do using Colab on Google Drive

import pandas as pd
import glob

path = r'/content/drive/My Drive/data/actual/comments_only' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True,sort=True)
frame.to_csv('/content/drive/onefile.csv')

回答 14

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")

file_iter = iter(file_path_list)

list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))

for file in file_iter:
    lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)
import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")

file_iter = iter(file_path_list)

list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))

for file in file_iter:
    lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)

通过整数索引选择一行熊猫系列/数据框

问题:通过整数索引选择一行熊猫系列/数据框

我很好奇,为什么df[2]不支持,而df.ix[2]df[2:3]这两个工作。

In [26]: df.ix[2]
Out[26]: 
A    1.027680
B    1.514210
C   -1.466963
D   -0.162339
Name: 2000-01-03 00:00:00

In [27]: df[2:3]
Out[27]: 
                  A        B         C         D
2000-01-03  1.02768  1.51421 -1.466963 -0.162339

我希望df[2]df[2:3]与Python索引约定一致的方式进行工作。是否有设计原因不支持按单个整数索引行?

I am curious as to why df[2] is not supported, while df.ix[2] and df[2:3] both work.

In [26]: df.ix[2]
Out[26]: 
A    1.027680
B    1.514210
C   -1.466963
D   -0.162339
Name: 2000-01-03 00:00:00

In [27]: df[2:3]
Out[27]: 
                  A        B         C         D
2000-01-03  1.02768  1.51421 -1.466963 -0.162339

I would expect df[2] to work the same way as df[2:3] to be consistent with Python indexing convention. Is there a design reason for not supporting indexing row by single integer?


回答 0

回显@HYRY,请参阅0.11中的新文档

http://pandas.pydata.org/pandas-docs/stable/indexing.html

在这里,我们有了新的运算符,.iloc以明确支持仅整数索引,并且.loc明确支持仅标签索引

例如,想象这种情况

In [1]: df = pd.DataFrame(np.random.rand(5,2),index=range(0,10,2),columns=list('AB'))

In [2]: df
Out[2]: 
          A         B
0  1.068932 -0.794307
2 -0.470056  1.192211
4 -0.284561  0.756029
6  1.037563 -0.267820
8 -0.538478 -0.800654

In [5]: df.iloc[[2]]
Out[5]: 
          A         B
4 -0.284561  0.756029

In [6]: df.loc[[2]]
Out[6]: 
          A         B
2 -0.470056  1.192211

[] 仅对行进行切片(按标签位置)

echoing @HYRY, see the new docs in 0.11

http://pandas.pydata.org/pandas-docs/stable/indexing.html

Here we have new operators, .iloc to explicity support only integer indexing, and .loc to explicity support only label indexing

e.g. imagine this scenario

In [1]: df = pd.DataFrame(np.random.rand(5,2),index=range(0,10,2),columns=list('AB'))

In [2]: df
Out[2]: 
          A         B
0  1.068932 -0.794307
2 -0.470056  1.192211
4 -0.284561  0.756029
6  1.037563 -0.267820
8 -0.538478 -0.800654

In [5]: df.iloc[[2]]
Out[5]: 
          A         B
4 -0.284561  0.756029

In [6]: df.loc[[2]]
Out[6]: 
          A         B
2 -0.470056  1.192211

[] slices the rows (by label location) only


回答 1

DataFrame索引运算符的主要目的[]是选择列。

当索引运算符传递字符串或整数时,它将尝试查找具有该特定名称的列并将其作为Series返回。

因此,在上述问题中:df[2]搜索与整数值匹配的列名2。该列不存在,并且KeyError引发a。


使用切片符号时,DataFrame索引运算符完全更改行为以选择行

奇怪的是,当给定切片时,DataFrame索引运算符选择行,并且可以按整数位置或按索引标签来选择行。

df[2:3]

这将从整数位置为2的行开始切为3,最后一个元素除外。因此,只需一行。下面的代码选择从整数位置6开始的行,直到每第三行从20开始但不包括20的行。

df[6:20:3]

如果DataFrame索引中包含字符串,则还可以使用由字符串标签组成的切片。有关更多详细信息,请参见.iloc与.loc上的此解决方案

我几乎从未将这种切片符号与索引运算符一起使用,因为它不是显式的,而且几乎从未使用过。按行切片时,请坚持使用.loc/.iloc

The primary purpose of the DataFrame indexing operator, [] is to select columns.

When the indexing operator is passed a string or integer, it attempts to find a column with that particular name and return it as a Series.

So, in the question above: df[2] searches for a column name matching the integer value 2. This column does not exist and a KeyError is raised.


The DataFrame indexing operator completely changes behavior to select rows when slice notation is used

Strangely, when given a slice, the DataFrame indexing operator selects rows and can do so by integer location or by index label.

df[2:3]

This will slice beginning from the row with integer location 2 up to 3, exclusive of the last element. So, just a single row. The following selects rows beginning at integer location 6 up to but not including 20 by every third row.

df[6:20:3]

You can also use slices consisting of string labels if your DataFrame index has strings in it. For more details, see this solution on .iloc vs .loc.

I almost never use this slice notation with the indexing operator as its not explicit and hardly ever used. When slicing by rows, stick with .loc/.iloc.


回答 2

您可以将DataFrame视为Series的字典。df[key]尝试通过选择列索引key并返回Series对象。

但是,在[]内切片会对行进行切片,因为这是非常常见的操作。

您可以阅读文档以了解详细信息:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics

You can think DataFrame as a dict of Series. df[key] try to select the column index by key and returns a Series object.

However slicing inside of [] slices the rows, because it’s a very common operation.

You can read the document for detail:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics


回答 3

要基于索引访问熊猫表,还可以考虑使用numpy.as_array选项将表转换为Numpy数组,方法如下:

np_df = df.as_matrix()

然后

np_df[i] 

会工作。

To index-based access to the pandas table, one can also consider numpy.as_array option to convert the table to Numpy array as

np_df = df.as_matrix()

and then

np_df[i] 

would work.


回答 4

您可以看一下源代码

DataFrame具有对_slice()进行切片的私有函数DataFrame,并且它允许参数axis确定要切片的轴。在__getitem__()DataFrame不设置轴,同时调用_slice()。因此_slice(),默认情况下将其切片为轴0。

您可以进行一个简单的实验,这可能对您有所帮助:

print df._slice(slice(0, 2))
print df._slice(slice(0, 2), 0)
print df._slice(slice(0, 2), 1)

You can take a look at the source code .

DataFrame has a private function _slice() to slice the DataFrame, and it allows the parameter axis to determine which axis to slice. The __getitem__() for DataFrame doesn’t set the axis while invoking _slice(). So the _slice() slice it by default axis 0.

You can take a simple experiment, that might help you:

print df._slice(slice(0, 2))
print df._slice(slice(0, 2), 0)
print df._slice(slice(0, 2), 1)

回答 5

您可以像这样遍历数据帧。

for ad in range(1,dataframe_c.size):
    print(dataframe_c.values[ad])

you can loop through the data frame like this .

for ad in range(1,dataframe_c.size):
    print(dataframe_c.values[ad])

使用熊猫的“大数据”工作流程

问题:使用熊猫的“大数据”工作流程

在学习熊猫的过程中,我试图迷惑了这个问题很多月。我在日常工作中使用SAS,这非常有用,因为它提供了核心支持。但是,由于许多其他原因,SAS作为一个软件也是很糟糕的。

有一天,我希望用python和pandas取代我对SAS的使用,但是我目前缺少大型数据集的核心工作流程。我不是在谈论需要分布式网络的“大数据”,而是文件太大而无法容纳在内存中,但文件又足够小而无法容纳在硬盘上。

我的第一个想法是用来HDFStore将大型数据集保存在磁盘上,然后仅将需要的部分拉入数据帧中进行分析。其他人提到MongoDB是更易于使用的替代方案。我的问题是这样的:

什么是实现以下目标的最佳实践工作流:

  1. 将平面文件加载到永久的磁盘数据库结构中
  2. 查询该数据库以检索要输入到熊猫数据结构中的数据
  3. 处理熊猫中的片段后更新数据库

现实世界中的示例将不胜感激,尤其是那些从“大数据”中使用熊猫的人。

编辑-我希望如何工作的示例:

  1. 迭代地导入一个大的平面文件,并将其存储在永久的磁盘数据库结构中。这些文件通常太大而无法容纳在内存中。
  2. 为了使用Pandas,我想读取这些数据的子集(通常一次只读取几列),使其适合内存。
  3. 我将通过对所选列执行各种操作来创建新列。
  4. 然后,我将不得不将这些新列添加到数据库结构中。

我正在尝试找到执行这些步骤的最佳实践方法。阅读有关熊猫和pytables的链接,似乎添加一个新列可能是个问题。

编辑-专门回答杰夫的问题:

  1. 我正在建立消费者信用风险模型。数据类型包括电话,SSN和地址特征;财产价值;犯罪记录,破产等贬义信息。我每天使用的数据集平均有近1,000到2,000个字段,这些字段是混合数据类型:数字和字符数据的连续,名义和有序变量。我很少追加行,但是我确实执行许多创建新列的操作。
  2. 典型的操作涉及使用条件逻辑将几个列合并到一个新的复合列中。例如,if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'。这些操作的结果是数据集中每个记录的新列。
  3. 最后,我想将这些新列添加到磁盘数据结构中。我将重复步骤2,使用交叉表和描述性统计数据探索数据,以寻找有趣的直观关系进行建模。
  4. 一个典型的项目文件通常约为1GB。文件组织成这样的方式,其中一行包含消费者数据记录。每条记录的每一行都有相同的列数。情况总是如此。
  5. 创建新列时,我会按行进行子集化是非常罕见的。但是,在创建报告或生成描述性统计信息时,对行进行子集化是很常见的。例如,我可能想为特定业务创建一个简单的频率,例如零售信用卡。为此,除了我要报告的任何列之外,我将只选择那些业务线=零售的记录。但是,在创建新列时,我将拉出所有数据行,而仅提取操作所需的列。
  6. 建模过程要求我分析每一列,寻找与某些结果变量有关的有趣关系,并创建描述这些关系的新复合列。我探索的列通常以小集合形式完成。例如,我将集中讨论一组20个仅涉及属性值的列,并观察它们与贷款违约的关系。一旦探索了这些列并创建了新的列,我便转到另一组列,例如大学学历,并重复该过程。我正在做的是创建候选变量,这些变量解释我的数据和某些结果之间的关系。在此过程的最后,我应用了一些学习技术,这些技术可以根据这些复合列创建方程。

我很少向数据集添加行。我几乎总是会创建新列(统计/机器学习术语中的变量或功能)。

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it’s out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.

One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I’m not talking about “big data” that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.

My first thought is to use HDFStore to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:

What are some best-practice workflows for accomplishing the following:

  1. Loading flat files into a permanent, on-disk database structure
  2. Querying that database to retrieve data to feed into a pandas data structure
  3. Updating the database after manipulating pieces in pandas

Real-world examples would be much appreciated, especially from anyone who uses pandas on “large data”.

Edit — an example of how I would like this to work:

  1. Iteratively import a large flat-file and store it in a permanent, on-disk database structure. These files are typically too large to fit in memory.
  2. In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory.
  3. I would create new columns by performing various operations on the selected columns.
  4. I would then have to append these new columns into the database structure.

I am trying to find a best-practice way of performing these steps. Reading links about pandas and pytables it seems that appending a new column could be a problem.

Edit — Responding to Jeff’s questions specifically:

  1. I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc… The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns.
  2. Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'. The result of these operations is a new column for every record in my dataset.
  3. Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model.
  4. A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case.
  5. It’s pretty rare that I would subset by rows when creating a new column. However, it’s pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations.
  6. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I’m doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.

It is rare that I would ever add rows to the dataset. I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).


回答 0

我通常以这种方式使用数十GB的数据,例如,我在磁盘上有一些表,这些表是通过查询读取,创建数据并追加回去的。

值得阅读文档以及该线程的后期内容,以获取有关如何存储数据的一些建议。

将影响您存储数据方式的详细信息,例如:
尽可能多地提供详细信息;我可以帮助您建立结构。

  1. 数据大小,行数,列数,列类型;您要追加行还是仅追加列?
  2. 典型的操作将是什么样的。例如,对列进行查询以选择一堆行和特定的列,然后执行一个操作(在内存中),创建新列并保存。
    (提供一个玩具示例可以使我们提供更具体的建议。)
  3. 处理完之后,您该怎么办?步骤2是临时的还是可重复的?
  4. 输入平面文件:大约总大小(以Gb为单位)。这些是如何组织的,例如通过记录?每个文件都包含不同的字段,还是每个文件都有一些记录,每个文件中都有所有字段?
  5. 您是否曾经根据条件选择行(记录)的子集(例如,选择字段A> 5的行)?然后执行某项操作,还是只选择包含所有记录的A,B,C字段(然后执行某项操作)?
  6. 您是否“工作”所有列(成组),还是只用于报告的比例很高(例如,您想保留数据,但无需明确地拉入该列,直到最终结果时间)?

确保至少0.10.1安装了熊猫

逐块读取迭代文件多个表查询

由于pytables已优化为按行操作(这是您要查询的内容),因此我们将为每组字段创建一个表。这样一来,很容易选择一小组字段(它将与一个大表一起使用,但是这样做更有效。我想我将来可能会解决此限制。这是更加直观):(
以下是伪代码。)

import numpy as np
import pandas as pd

# create a store
store = pd.HDFStore('mystore.h5')

# this is the key to your storage:
#    this maps your fields to a specific group, and defines 
#    what you want to have as data_columns.
#    you might want to create a nice class wrapping this
#    (as you will want to have this map and its inversion)  
group_map = dict(
    A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
    B = dict(fields = ['field_10',......        ], dc = ['field_10']),
    .....
    REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),

)

group_map_inverted = dict()
for g, v in group_map.items():
    group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))

读入文件并创建存储(基本上是做什么append_to_multiple):

for f in files:
   # read in the file, additional options may be necessary here
   # the chunksize is not strictly necessary, you may be able to slurp each 
   # file into memory in which case just eliminate this part of the loop 
   # (you can also change chunksize if necessary)
   for chunk in pd.read_table(f, chunksize=50000):
       # we are going to append to each table by group
       # we are not going to create indexes at this time
       # but we *ARE* going to create (some) data_columns

       # figure out the field groupings
       for g, v in group_map.items():
             # create the frame for this group
             frame = chunk.reindex(columns = v['fields'], copy = False)    

             # append it
             store.append(g, frame, index=False, data_columns = v['dc'])

现在,您已将所有表存储在文件中(实际上,您可以根据需要将它们存储在单独的文件中,您可能需要将文件名添加到group_map中,但这可能不是必需的)。

这是获取列并创建新列的方式:

frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
#     select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows

# do calculations on this frame
new_frame = cool_function_on_frame(frame)

# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)

准备进行后期处理时:

# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)

关于data_columns,实际上不需要定义任何 data_columns。它们使您可以根据列来子选择行。例如:

store.select(group, where = ['field_1000=foo', 'field_1001>0'])

在最后的报告生成阶段,它们可能对您来说最有趣(实际上,数据列与其他列是分开的,如果定义太多,这可能会影响效率)。

您可能还想:

  • 创建一个使用字段列表的函数,在groups_map中查找组,然后选择它们并连接结果,以便获得结果框架(本质上就是select_as_multiple所做的事情)。这样,结构对您将非常透明。
  • 在某些数据列上建立索引(使行子设置快得多)。
  • 启用压缩。

如有疑问,请告诉我!

I routinely use tens of gigabytes of data in just this fashion e.g. I have tables on disk that I read via queries, create data and append back.

It’s worth reading the docs and late in this thread for several suggestions for how to store your data.

Details which will affect how you store your data, like:
Give as much detail as you can; and I can help you develop a structure.

  1. Size of data, # of rows, columns, types of columns; are you appending rows, or just columns?
  2. What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these.
    (Giving a toy example could enable us to offer more specific recommendations.)
  3. After that processing, then what do you do? Is step 2 ad hoc, or repeatable?
  4. Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file?
  5. Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
  6. Do you ‘work on’ all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don’t need to pull in that column explicity until final results time)?

Solution

Ensure you have pandas at least 0.10.1 installed.

Read iterating files chunk-by-chunk and multiple table queries.

Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. This way it’s easy to select a small group of fields (which will work with a big table, but it’s more efficient to do it this way… I think I may be able to fix this limitation in the future… this is more intuitive anyhow):
(The following is pseudocode.)

import numpy as np
import pandas as pd

# create a store
store = pd.HDFStore('mystore.h5')

# this is the key to your storage:
#    this maps your fields to a specific group, and defines 
#    what you want to have as data_columns.
#    you might want to create a nice class wrapping this
#    (as you will want to have this map and its inversion)  
group_map = dict(
    A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
    B = dict(fields = ['field_10',......        ], dc = ['field_10']),
    .....
    REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),

)

group_map_inverted = dict()
for g, v in group_map.items():
    group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))

Reading in the files and creating the storage (essentially doing what append_to_multiple does):

for f in files:
   # read in the file, additional options may be necessary here
   # the chunksize is not strictly necessary, you may be able to slurp each 
   # file into memory in which case just eliminate this part of the loop 
   # (you can also change chunksize if necessary)
   for chunk in pd.read_table(f, chunksize=50000):
       # we are going to append to each table by group
       # we are not going to create indexes at this time
       # but we *ARE* going to create (some) data_columns

       # figure out the field groupings
       for g, v in group_map.items():
             # create the frame for this group
             frame = chunk.reindex(columns = v['fields'], copy = False)    

             # append it
             store.append(g, frame, index=False, data_columns = v['dc'])

Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn’t necessary).

This is how you get columns and create new ones:

frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
#     select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows

# do calculations on this frame
new_frame = cool_function_on_frame(frame)

# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)

When you are ready for post_processing:

# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)

About data_columns, you don’t actually need to define ANY data_columns; they allow you to sub-select rows based on the column. E.g. something like:

store.select(group, where = ['field_1000=foo', 'field_1001>0'])

They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot).

You also might want to:

  • create a function which takes a list of fields, looks up the groups in the groups_map, then selects these and concatenates the results so you get the resulting frame (this is essentially what select_as_multiple does). This way the structure would be pretty transparent to you.
  • indexes on certain data columns (makes row-subsetting much faster).
  • enable compression.

Let me know when you have questions!


回答 1

我认为以上答案都缺少一种我发现非常有用的简单方法。

当我的文件太大而无法加载到内存中时,我将该文件分成多个较小的文件(按行或列)

示例:如果有30天的〜30GB大小的交易数据值得每天将其拆分为一个〜1GB大小的文件。随后,我分别处理每个文件,并在最后汇总结果

最大的优势之一是它允许并行处理文件(多个线程或多个进程)

另一个优点是文件操作(如示例中的添加/删除日期)可以通过常规的shell命令完成,而在更高级/更复杂的文件格式中则无法实现

这种方法无法涵盖所有​​情况,但在许多情况下非常有用

I think the answers above are missing a simple approach that I’ve found very useful.

When I have a file that is too large to load in memory, I break up the file into multiple smaller files (either by row or cols)

Example: In case of 30 days worth of trading data of ~30GB size, I break it into a file per day of ~1GB size. I subsequently process each file separately and aggregate results at the end

One of the biggest advantages is that it allows parallel processing of the files (either multiple threads or processes)

The other advantage is that file manipulation (like adding/removing dates in the example) can be accomplished by regular shell commands, which is not be possible in more advanced/complicated file formats

This approach doesn’t cover all scenarios, but is very useful in a lot of them


回答 2

问题提出两年后,现在出现了一个“核心外”熊猫:dask。太好了!尽管它不支持所有熊猫功能,但您可以真正做到这一点。

There is now, two years after the question, an ‘out-of-core’ pandas equivalent: dask. It is excellent! Though it does not support all of pandas functionality, you can get really far with it.


回答 3

如果您的数据集介于1到20GB之间,则应该获得具有48GB RAM的工作站。然后,熊猫可以将整个数据集保存在RAM中。我知道这不是您在这里寻找的答案,但是在具有4GB RAM的笔记本电脑上进行科学计算是不合理的。

If your datasets are between 1 and 20GB, you should get a workstation with 48GB of RAM. Then Pandas can hold the entire dataset in RAM. I know its not the answer you’re looking for here, but doing scientific computing on a notebook with 4GB of RAM isn’t reasonable.


回答 4

我知道这是一个旧线程,但是我认为Blaze库值得一试。它是针对此类情况而构建的。

从文档:

Blaze将NumPy和Pandas的可用性扩展到分布式和核外计算。Blaze提供了类似于NumPy ND-Array或Pandas DataFrame的接口,但是将这些熟悉的接口映射到了其他各种计算引擎上,例如Postgres或Spark。

编辑:顺便说一下,它由ContinuumIO和NumPy的作者Travis Oliphant支持。

I know this is an old thread but I think the Blaze library is worth checking out. It’s built for these types of situations.

From the docs:

Blaze extends the usability of NumPy and Pandas to distributed and out-of-core computing. Blaze provides an interface similar to that of the NumPy ND-Array or Pandas DataFrame but maps these familiar interfaces onto a variety of other computational engines like Postgres or Spark.

Edit: By the way, it’s supported by ContinuumIO and Travis Oliphant, author of NumPy.


回答 5

pymongo就是这种情况。我还使用python中的sql server,sqlite,HDF,ORM(SQLAlchemy)进行了原型设计。首要的pymongo是基于文档的数据库,因此每个人都是(dict具有属性的)文档。很多人组成一个集合,您可以有很多集合(人,股票市场,收入)。

pd.dateframe-> pymongo注意:我使用chunksizein read_csv使其保持5到10k记录(如果较大,pymongo会删除套接字)

aCollection.insert((a[1].to_dict() for a in df.iterrows()))

查询:gt =大于…

pd.DataFrame(list(mongoCollection.find({'anAttribute':{'$gt':2887000, '$lt':2889000}})))

.find()返回一个迭代器,因此我通常将ichunked其切成更小的迭代器。

由于我通常可以将10个数据源粘贴在一起,因此如何进行联接:

aJoinDF = pandas.DataFrame(list(mongoCollection.find({'anAttribute':{'$in':Att_Keys}})))

然后(就我而言,有时我必须aJoinDF先进行“可合并”操作)。

df = pandas.merge(df, aJoinDF, on=aKey, how='left')

然后,您可以通过下面的update方法将新信息写入您的主要收藏夹。(逻辑收集与物理数据源)。

collection.update({primarykey:foo},{key:change})

在较小的查询中,只需进行非规范化即可。例如,您在文档中有代码,而您仅添加域代码文本并在dict创建文档时进行查找。

现在,您有了一个基于人的漂亮数据集,您可以在每种情况下释放自己的逻辑并添加更多属性。最后,您可以将3个最大记忆键指标读入大熊猫,并进行数据透视/汇总/数据探索。这对我来说适合300万条带有数字/大文本/类别/代码/浮点数/ …的记录

您还可以使用MongoDB内置的两种方法(MapReduce和聚合框架)。有关聚合框架的更多信息,请参见此处,因为它似乎比MapReduce容易,并且看起来便于进行快速聚合工作。注意,我不需要定义字段或关系,可以将项目添加到文档中。在快速变化的numpy,pandas,python工具集的当前状态下,MongoDB可以帮助我开始工作:)

This is the case for pymongo. I have also prototyped using sql server, sqlite, HDF, ORM (SQLAlchemy) in python. First and foremost pymongo is a document based DB, so each person would be a document (dict of attributes). Many people form a collection and you can have many collections (people, stock market, income).

pd.dateframe -> pymongo Note: I use the chunksize in read_csv to keep it to 5 to 10k records(pymongo drops the socket if larger)

aCollection.insert((a[1].to_dict() for a in df.iterrows()))

querying: gt = greater than…

pd.DataFrame(list(mongoCollection.find({'anAttribute':{'$gt':2887000, '$lt':2889000}})))

.find() returns an iterator so I commonly use ichunked to chop into smaller iterators.

How about a join since I normally get 10 data sources to paste together:

aJoinDF = pandas.DataFrame(list(mongoCollection.find({'anAttribute':{'$in':Att_Keys}})))

then (in my case sometimes I have to agg on aJoinDF first before its “mergeable”.)

df = pandas.merge(df, aJoinDF, on=aKey, how='left')

And you can then write the new info to your main collection via the update method below. (logical collection vs physical datasources).

collection.update({primarykey:foo},{key:change})

On smaller lookups, just denormalize. For example, you have code in the document and you just add the field code text and do a dict lookup as you create documents.

Now you have a nice dataset based around a person, you can unleash your logic on each case and make more attributes. Finally you can read into pandas your 3 to memory max key indicators and do pivots/agg/data exploration. This works for me for 3 million records with numbers/big text/categories/codes/floats/…

You can also use the two methods built into MongoDB (MapReduce and aggregate framework). See here for more info about the aggregate framework, as it seems to be easier than MapReduce and looks handy for quick aggregate work. Notice I didn’t need to define my fields or relations, and I can add items to a document. At the current state of the rapidly changing numpy, pandas, python toolset, MongoDB helps me just get to work :)


回答 6

我发现这有点晚了,但我遇到了类似的问题(抵押预付款模型)。我的解决方案是跳过熊猫HDFStore层,并使用直接pytables。我将每列保存为最终文件中的单独HDF5阵列。

我的基本工作流程是首先从数据库中获取CSV文件。我用gzip压缩,所以它没有那么大。然后,通过在python中对其进行迭代,将每一行转换为实际数据类型并将其写入HDF5文件,将其转换为面向行的HDF5文件。这花费了数十分钟,但是它不使用任何内存,因为它只是逐行地操作。然后,我将面向行的HDF5文件“转置”为面向列的HDF5文件。

表转置如下:

def transpose_table(h_in, table_path, h_out, group_name="data", group_path="/"):
    # Get a reference to the input data.
    tb = h_in.getNode(table_path)
    # Create the output group to hold the columns.
    grp = h_out.createGroup(group_path, group_name, filters=tables.Filters(complevel=1))
    for col_name in tb.colnames:
        logger.debug("Processing %s", col_name)
        # Get the data.
        col_data = tb.col(col_name)
        # Create the output array.
        arr = h_out.createCArray(grp,
                                 col_name,
                                 tables.Atom.from_dtype(col_data.dtype),
                                 col_data.shape)
        # Store the data.
        arr[:] = col_data
    h_out.flush()

然后读回它就像:

def read_hdf5(hdf5_path, group_path="/data", columns=None):
    """Read a transposed data set from a HDF5 file."""
    if isinstance(hdf5_path, tables.file.File):
        hf = hdf5_path
    else:
        hf = tables.openFile(hdf5_path)

    grp = hf.getNode(group_path)
    if columns is None:
        data = [(child.name, child[:]) for child in grp]
    else:
        data = [(child.name, child[:]) for child in grp if child.name in columns]

    # Convert any float32 columns to float64 for processing.
    for i in range(len(data)):
        name, vec = data[i]
        if vec.dtype == np.float32:
            data[i] = (name, vec.astype(np.float64))

    if not isinstance(hdf5_path, tables.file.File):
        hf.close()
    return pd.DataFrame.from_items(data)

现在,我通常在具有大量内存的计算机上运行此程序,因此我可能对内存使用情况不够谨慎。例如,默认情况下,装入操作将读取整个数据集。

这通常对我有用,但是有点笨拙,我不能使用花式的pytables魔术。

编辑:与默认的记录数组pytables相比,此方法的真正优势在于,我可以使用无法处理表的h5r将数据加载到R中。或者,至少,我无法使其加载异类表。

I spotted this a little late, but I work with a similar problem (mortgage prepayment models). My solution has been to skip the pandas HDFStore layer and use straight pytables. I save each column as an individual HDF5 array in my final file.

My basic workflow is to first get a CSV file from the database. I gzip it, so it’s not as huge. Then I convert that to a row-oriented HDF5 file, by iterating over it in python, converting each row to a real data type, and writing it to a HDF5 file. That takes some tens of minutes, but it doesn’t use any memory, since it’s only operating row-by-row. Then I “transpose” the row-oriented HDF5 file into a column-oriented HDF5 file.

The table transpose looks like:

def transpose_table(h_in, table_path, h_out, group_name="data", group_path="/"):
    # Get a reference to the input data.
    tb = h_in.getNode(table_path)
    # Create the output group to hold the columns.
    grp = h_out.createGroup(group_path, group_name, filters=tables.Filters(complevel=1))
    for col_name in tb.colnames:
        logger.debug("Processing %s", col_name)
        # Get the data.
        col_data = tb.col(col_name)
        # Create the output array.
        arr = h_out.createCArray(grp,
                                 col_name,
                                 tables.Atom.from_dtype(col_data.dtype),
                                 col_data.shape)
        # Store the data.
        arr[:] = col_data
    h_out.flush()

Reading it back in then looks like:

def read_hdf5(hdf5_path, group_path="/data", columns=None):
    """Read a transposed data set from a HDF5 file."""
    if isinstance(hdf5_path, tables.file.File):
        hf = hdf5_path
    else:
        hf = tables.openFile(hdf5_path)

    grp = hf.getNode(group_path)
    if columns is None:
        data = [(child.name, child[:]) for child in grp]
    else:
        data = [(child.name, child[:]) for child in grp if child.name in columns]

    # Convert any float32 columns to float64 for processing.
    for i in range(len(data)):
        name, vec = data[i]
        if vec.dtype == np.float32:
            data[i] = (name, vec.astype(np.float64))

    if not isinstance(hdf5_path, tables.file.File):
        hf.close()
    return pd.DataFrame.from_items(data)

Now, I generally run this on a machine with a ton of memory, so I may not be careful enough with my memory usage. For example, by default the load operation reads the whole data set.

This generally works for me, but it’s a bit clunky, and I can’t use the fancy pytables magic.

Edit: The real advantage of this approach, over the array-of-records pytables default, is that I can then load the data into R using h5r, which can’t handle tables. Or, at least, I’ve been unable to get it to load heterogeneous tables.


回答 7

我发现对大型数据用例有用的一个技巧是通过将浮点精度降低到32位来减少数据量。它并非在所有情况下都适用,但是在许多应用程序中,64位精度过高,并且节省2倍的内存值得。提出一个显而易见的观点:

>>> df = pd.DataFrame(np.random.randn(int(1e8), 5))
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 5 columns):
...
dtypes: float64(5)
memory usage: 3.7 GB

>>> df.astype(np.float32).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 5 columns):
...
dtypes: float32(5)
memory usage: 1.9 GB

One trick I found helpful for large data use cases is to reduce the volume of the data by reducing float precision to 32-bit. It’s not applicable in all cases, but in many applications 64-bit precision is overkill and the 2x memory savings are worth it. To make an obvious point even more obvious:

>>> df = pd.DataFrame(np.random.randn(int(1e8), 5))
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 5 columns):
...
dtypes: float64(5)
memory usage: 3.7 GB

>>> df.astype(np.float32).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000000 entries, 0 to 99999999
Data columns (total 5 columns):
...
dtypes: float32(5)
memory usage: 1.9 GB

回答 8

正如其他人所指出的,若干年后的“外的核心”大熊猫相当于已经出现:DASK。尽管dask并不是熊猫及其所有功能的直接替代品,但它出于以下几个原因而脱颖而出:

Dask是一个灵活的用于分析计算的并行计算库,针对动态任务调度进行了优化,以针对“大数据”集合(如并行数组,数据框和列表)的交互式计算工作负载进行动态任务调度,这些列表将诸如NumPy,Pandas或Python迭代器之类的通用接口扩展为更大的-非内存或分布式环境,并可以从便携式计算机扩展到群集。

达斯克强调以下优点:

  • 熟悉:提供并行的NumPy数组和Pandas DataFrame对象
  • 灵活:提供任务调度界面,用于更多自定义工作负载并与其他项目集成。
  • 本机:通过访问PyData堆栈,在Pure Python中启用分布式计算。
  • 快速:以低开销,低延迟和快速数值算法所需的最少序列化操作
  • 扩大规模:在具有1000个核心的集群上灵活运行缩小规模:在单个过程中轻松设置并在笔记本电脑上运行
  • 响应式:设计时考虑了交互式计算,可提供快速反馈和诊断以帮助人类

并添加一个简单的代码示例:

import dask.dataframe as dd
df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean().compute()

替换这样的一些熊猫代码:

import pandas as pd
df = pd.read_csv('2015-01-01.csv')
df.groupby(df.user_id).value.mean()

并且特别值得注意的是,通过该concurrent.futures界面提供了用于提交自定义任务的通用基础架构:

from dask.distributed import Client
client = Client('scheduler:port')

futures = []
for fn in filenames:
    future = client.submit(load, fn)
    futures.append(future)

summary = client.submit(summarize, futures)
summary.result()

As noted by others, after some years an ‘out-of-core’ pandas equivalent has emerged: dask. Though dask is not a drop-in replacement of pandas and all of its functionality it stands out for several reasons:

Dask is a flexible parallel computing library for analytic computing that is optimized for dynamic task scheduling for interactive computational workloads of “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments and scales from laptops to clusters.

Dask emphasizes the following virtues:

  • Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
  • Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
  • Native: Enables distributed computing in Pure Python with access to the PyData stack.
  • Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
  • Scales up: Runs resiliently on clusters with 1000s of cores Scales down: Trivial to set up and run on a laptop in a single process
  • Responsive: Designed with interactive computing in mind it provides rapid feedback and diagnostics to aid humans

and to add a simple code sample:

import dask.dataframe as dd
df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean().compute()

replaces some pandas code like this:

import pandas as pd
df = pd.read_csv('2015-01-01.csv')
df.groupby(df.user_id).value.mean()

and, especially noteworthy, provides through the concurrent.futures interface a general infrastructure for the submission of custom tasks:

from dask.distributed import Client
client = Client('scheduler:port')

futures = []
for fn in filenames:
    future = client.submit(load, fn)
    futures.append(future)

summary = client.submit(summarize, futures)
summary.result()

回答 9

在这里还值得一提的是Ray
它是一个分布式计算框架,它以分布式方式自己实现了对熊猫的实现。

只需替换pandas导入,代码应该可以正常运行:

# import pandas as pd
import ray.dataframe as pd

#use pd as usual

可以在这里阅读更多详细信息:

https://rise.cs.berkeley.edu/blog/pandas-on-ray/

It is worth mentioning here Ray as well,
it’s a distributed computation framework, that has it’s own implementation for pandas in a distributed way.

Just replace the pandas import, and the code should work as is:

# import pandas as pd
import ray.dataframe as pd

#use pd as usual

can read more details here:

https://rise.cs.berkeley.edu/blog/pandas-on-ray/


回答 10

另一种变化

在熊猫中完成的许多操作也可以作为db查询来完成(sql,mongo)

使用RDBMS或mongodb,您可以在数据库查询中执行某些聚合(针对大型数据进行了优化,并有效地使用了缓存和索引)

以后,您可以使用熊猫进行后期处理。

这种方法的优点是,您可以在处理大型数据时获得数据库优化,同时仍可以使用高级声明性语法定义逻辑-无需处理决定在内存中做什么和做什么的细节。的核心。

尽管查询语言和熊猫语言不同,但是将部分逻辑从一个逻辑转换到另一个逻辑通常并不复杂。

One more variation

Many of the operations done in pandas can also be done as a db query (sql, mongo)

Using a RDBMS or mongodb allows you to perform some of the aggregations in the DB Query (which is optimized for large data, and uses cache and indexes efficiently)

Later, you can perform post processing using pandas.

The advantage of this method is that you gain the DB optimizations for working with large data, while still defining the logic in a high level declarative syntax – and not having to deal with the details of deciding what to do in memory and what to do out of core.

And although the query language and pandas are different, it’s usually not complicated to translate part of the logic from one to another.


回答 11

如果您走创建数据管道的简单路径,请将该路径分解为多个较小的文件,请考虑使用Ruffus

Consider Ruffus if you go the simple path of creating a data pipeline which is broken down into multiple smaller files.


回答 12

我最近遇到了类似的问题。我发现简单地读取数据并将数据块追加到同一csv时效果很好。我的问题是,使用某些列的值,根据另一张表中的信息添加日期列。这可能会帮助那些对dask和hdf5感到困惑的人,但更熟悉像我这样的熊猫。

def addDateColumn():
"""Adds time to the daily rainfall data. Reads the csv as chunks of 100k 
   rows at a time and outputs them, appending as needed, to a single csv. 
   Uses the column of the raster names to get the date.
"""
    df = pd.read_csv(pathlist[1]+"CHIRPS_tanz.csv", iterator=True, 
                     chunksize=100000) #read csv file as 100k chunks

    '''Do some stuff'''

    count = 1 #for indexing item in time list 
    for chunk in df: #for each 100k rows
        newtime = [] #empty list to append repeating times for different rows
        toiterate = chunk[chunk.columns[2]] #ID of raster nums to base time
        while count <= toiterate.max():
            for i in toiterate: 
                if i ==count:
                    newtime.append(newyears[count])
            count+=1
        print "Finished", str(chunknum), "chunks"
        chunk["time"] = newtime #create new column in dataframe based on time
        outname = "CHIRPS_tanz_time2.csv"
        #append each output to same csv, using no header
        chunk.to_csv(pathlist[2]+outname, mode='a', header=None, index=None)

I recently came across a similar issue. I found simply reading the data in chunks and appending it as I write it in chunks to the same csv works well. My problem was adding a date column based on information in another table, using the value of certain columns as follows. This may help those confused by dask and hdf5 but more familiar with pandas like myself.

def addDateColumn():
"""Adds time to the daily rainfall data. Reads the csv as chunks of 100k 
   rows at a time and outputs them, appending as needed, to a single csv. 
   Uses the column of the raster names to get the date.
"""
    df = pd.read_csv(pathlist[1]+"CHIRPS_tanz.csv", iterator=True, 
                     chunksize=100000) #read csv file as 100k chunks

    '''Do some stuff'''

    count = 1 #for indexing item in time list 
    for chunk in df: #for each 100k rows
        newtime = [] #empty list to append repeating times for different rows
        toiterate = chunk[chunk.columns[2]] #ID of raster nums to base time
        while count <= toiterate.max():
            for i in toiterate: 
                if i ==count:
                    newtime.append(newyears[count])
            count+=1
        print "Finished", str(chunknum), "chunks"
        chunk["time"] = newtime #create new column in dataframe based on time
        outname = "CHIRPS_tanz_time2.csv"
        #append each output to same csv, using no header
        chunk.to_csv(pathlist[2]+outname, mode='a', header=None, index=None)

回答 13

我想指出一下Vaex软件包。

Vaex是用于惰性核心数据框架(类似于Pandas)的python库,用于可视化和探索大型表格数据集。它可以在高达每秒十亿(10 9)个对象/行的N维网格上计算统计信息,例如平均值,总和,计数,标准差等。可视化使用直方图,密度图和3d体积渲染完成,从而可以交互式探索大数据。Vaex使用内存映射,零内存复制策略和惰性计算来获得最佳性能(不浪费内存)。

看一下文档:https : //vaex.readthedocs.io/en/latest/ 该API非常接近于熊猫API。

I’d like to point out the Vaex package.

Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

Have a look at the documentation: https://vaex.readthedocs.io/en/latest/ The API is very close to the API of pandas.


回答 14

为什么选择熊猫?您是否尝试过标准Python

使用标准库python。即使最近发布了稳定版,Pandas也会经常更新。

使用标准的python库,您的代码将始终运行。

一种实现方法是对要存储数据的方式有所了解,并对数据要解决哪些问题。然后绘制一个模式,说明如何组织数据(思考表),这将有助于您查询数据,而不必进行规范化。

您可以充分利用:

  • 字典列表,用于将数据存储在内存中,一个字典为一行,
  • 生成器逐行处理数据,以免RAM溢出,
  • 列出理解以查询您的数据,
  • 利用Counter,DefaultDict,…
  • 使用您选择的任何存储解决方案将数据存储在硬盘上,json可能是其中之一。

随着时间的推移,Ram和HDD越来越便宜,并且标准python 3广泛可用且稳定。

Why Pandas ? Have you tried Standard Python ?

The use of standard library python. Pandas is subject to frequent updates, even with the recent release of the stable version.

Using the standard python library your code will always run.

One way of doing it is to have an idea of the way you want your data to be stored , and which questions you want to solve regarding the data. Then draw a schema of how you can organise your data (think tables) that will help you query the data, not necessarily normalisation.

You can make good use of :

  • list of dictionaries to store the data in memory, one dict being one row,
  • generators to process the data row after row to not overflow your RAM,
  • list comprehension to query your data,
  • make use of Counter, DefaultDict, …
  • store your data on your hard drive using whatever storing solution you have chosen, json could be one of them.

Ram and HDD is becoming cheaper and cheaper with time and standard python 3 is widely available and stable.


回答 15

目前,我正在“喜欢”您,只是规模较小,这就是为什么我没有PoC来建议的原因。

但是,我似乎在使用pickle作为缓存系统并将各种功能的执行外包到文件中找到了成功-从我的commando / main文件中执行这些文件。例如,我使用prepare_use.py转换对象类型,将数据集拆分为测试,验证和预测数据集。

用咸菜进行缓存如何工作?我使用字符串来访问动态创建的pickle文件,具体取决于传递了哪些参数和数据集(为此,我尝试捕获并确定程序是否已在运行,使用.shape表示数据集,使用dict表示通过参数)。尊重这些措施,我得到一个String来尝试查找和读取.pickle文件,并且如果找到了该字符串,则可以跳过处理时间以跳转到我现在正在处理的执行。

使用数据库时,我遇到了类似的问题,这就是为什么我在使用此解决方案时感到高兴的原因,但是-有很多限制-例如由于冗余而存储大量的泡菜集。可以使用正确的索引从转换前到更新表进行更新-验证信息可以打开另一本完整的书(我尝试合并爬网的租金数据,基本上在2小时后停止使用数据库-因为我想在之后跳回每个转换过程)

我希望我的2美分能以某种方式对您有所帮助。

问候。

At the moment I am working “like” you, just on a lower scale, which is why I don’t have a PoC for my suggestion.

However, I seem to find success in using pickle as caching system and outsourcing execution of various functions into files – executing these files from my commando / main file; For example i use a prepare_use.py to convert object types, split a data set into test, validating and prediction data set.

How does your caching with pickle work? I use strings in order to access pickle-files that are dynamically created, depending on which parameters and data sets were passed (with that i try to capture and determine if the program was already run, using .shape for data set, dict for passed parameters). Respecting these measures, i get a String to try to find and read a .pickle-file and can, if found, skip processing time in order to jump to the execution i am working on right now.

Using databases I encountered similar problems, which is why i found joy in using this solution, however – there are many constraints for sure – for example storing huge pickle sets due to redundancy. Updating a table from before to after a transformation can be done with proper indexing – validating information opens up a whole other book (I tried consolidating crawled rent data and stopped using a database after 2 hours basically – as I would have liked to jump back after every transformation process)

I hope my 2 cents help you in some way.

Greetings.