标签归档:pandas

使用.corr获取两列之间的相关性

问题:使用.corr获取两列之间的相关性

我有以下熊猫数据框Top15

我创建了一个估计每人可引用文件数量的列:

Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
Top15['Citable docs per Capita'] = Top15['Citable documents'] / Top15['PopEst']

我想知道人均引用文件数量与人均能源供应之间的相关性。因此,我使用了.corr()方法(皮尔逊相关性):

data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')

我想返回一个数字,但是结果是:

I have the following pandas dataframe Top15:

I create a column that estimates the number of citable documents per person:

Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
Top15['Citable docs per Capita'] = Top15['Citable documents'] / Top15['PopEst']

I want to know the correlation between the number of citable documents per capita and the energy supply per capita. So I use the .corr() method (Pearson’s correlation):

data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')

I want to return a single number, but the result is:


回答 0

没有实际数据,很难回答这个问题,但是我想您正在寻找这样的东西:

Top15['Citable docs per Capita'].corr(Top15['Energy Supply per Capita'])

这样就可以计算出两列 'Citable docs per Capita'之间的相关性'Energy Supply per Capita'

举个例子:

import pandas as pd

df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})

   A  B
0  0  0
1  1  2
2  2  4
3  3  6

然后

df['A'].corr(df['B'])

给出1预期。

现在,如果您更改一个值,例如

df.loc[2, 'B'] = 4.5

   A    B
0  0  0.0
1  1  2.0
2  2  4.5
3  3  6.0

命令

df['A'].corr(df['B'])

退货

0.99586

仍接近预期的1。

如果.corr直接应用于数据框,它将返回列之间的所有成对关联;这就是为什么您然后1s在矩阵的对角线处进行观察的原因(每列与自身完全相关)。

df.corr()

因此将返回

          A         B
A  1.000000  0.995862
B  0.995862  1.000000

在您显示的图形中,仅表示相关矩阵的左上角(我假设)。

在某些情况下,您可以NaN在解决方案中找到s-请查看示例。

如果要过滤高于或低于特定阈值的条目,可以检查此问题。如果要绘制相关系数的热图,可以检查该答案,如果然后遇到轴标签重叠的问题,请检查以下文章

Without actual data it is hard to answer the question but I guess you are looking for something like this:

Top15['Citable docs per Capita'].corr(Top15['Energy Supply per Capita'])

That calculates the correlation between your two columns 'Citable docs per Capita' and 'Energy Supply per Capita'.

To give an example:

import pandas as pd

df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})

   A  B
0  0  0
1  1  2
2  2  4
3  3  6

Then

df['A'].corr(df['B'])

gives 1 as expected.

Now, if you change a value, e.g.

df.loc[2, 'B'] = 4.5

   A    B
0  0  0.0
1  1  2.0
2  2  4.5
3  3  6.0

the command

df['A'].corr(df['B'])

returns

0.99586

which is still close to 1, as expected.

If you apply .corr directly to your dataframe, it will return all pairwise correlations between your columns; that’s why you then observe 1s at the diagonal of your matrix (each column is perfectly correlated with itself).

df.corr()

will therefore return

          A         B
A  1.000000  0.995862
B  0.995862  1.000000

In the graphic you show, only the upper left corner of the correlation matrix is represented (I assume).

There can be cases, where you get NaNs in your solution – check this post for an example.

If you want to filter entries above/below a certain threshold, you can check this question. If you want to plot a heatmap of the correlation coefficients, you can check this answer and if you then run into the issue with overlapping axis-labels check the following post.


回答 1

我遇到了同样的问题。它似乎Citable Documents per Person是一个浮点数,Python默认以某种方式跳过它。我数据框的所有其他列均为numpy格式,因此我通过将columnt转换为np.float64

Top15['Citable Documents per Person']=np.float64(Top15['Citable Documents per Person'])

请记住,这正是您自己计算的列

I ran into the same issue. It appeared Citable Documents per Person was a float, and python skips it somehow by default. All the other columns of my dataframe were in numpy-formats, so I solved it by converting the columnt to np.float64

Top15['Citable Documents per Person']=np.float64(Top15['Citable Documents per Person'])

Remember it’s exactly the column you calculated yourself


回答 2

我的解决方案是将数据转换为数值类型后:

Top15[['Citable docs per Capita','Energy Supply per Capita']].corr()

My solution would be after converting data to numerical type:

Top15[['Citable docs per Capita','Energy Supply per Capita']].corr()

回答 3

如果要在所有成对的列之间建立关联,可以执行以下操作:

import pandas as pd
import numpy as np

def get_corrs(df):
    col_correlations = df.corr()
    col_correlations.loc[:, :] = np.tril(col_correlations, k=-1)
    cor_pairs = col_correlations.stack()
    return cor_pairs.to_dict()

my_corrs = get_corrs(df)
# and the following line to retrieve the single correlation
print(my_corrs[('Citable docs per Capita','Energy Supply per Capita')])

If you want the correlations between all pairs of columns, you could do something like this:

import pandas as pd
import numpy as np

def get_corrs(df):
    col_correlations = df.corr()
    col_correlations.loc[:, :] = np.tril(col_correlations, k=-1)
    cor_pairs = col_correlations.stack()
    return cor_pairs.to_dict()

my_corrs = get_corrs(df)
# and the following line to retrieve the single correlation
print(my_corrs[('Citable docs per Capita','Energy Supply per Capita')])

回答 4

当您调用:

data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')

由于DataFrame.corr()函数执行成对关联,因此您需要从两个变量中获得四对。因此,基本上,您会得到对角线值作为自动相关性(与自身相关,两个值,因为您有两个变量),而其他两个值作为一个对另一个的互相关,反之亦然。

在两个序列之间执行相关以获得单个值:

from scipy.stats.stats import pearsonr
docs_col = Top15['Citable docs per Capita'].values
energy_col = Top15['Energy Supply per Capita'].values
corr , _ = pearsonr(docs_col, energy_col)

或者,如果您想从同一函数(DataFrame的corr)中获得一个值:

single_value = correlation[0][1] 

希望这可以帮助。

When you call this:

data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')

Since, DataFrame.corr() function performs pair-wise correlations, you have four pair from two variables. So, basically you are getting diagonal values as auto correlation (correlation with itself, two values since you have two variables), and other two values as cross correlations of one vs another and vice versa.

Either perform correlation between two series to get a single value:

from scipy.stats.stats import pearsonr
docs_col = Top15['Citable docs per Capita'].values
energy_col = Top15['Energy Supply per Capita'].values
corr , _ = pearsonr(docs_col, energy_col)

or, if you want a single value from the same function (DataFrame’s corr):

single_value = correlation[0][1] 

Hope this helps.


回答 5

它是这样的:

Top15['Citable docs per Capita']=np.float64(Top15['Citable docs per Capita'])

Top15['Energy Supply per Capita']=np.float64(Top15['Energy Supply per Capita'])

Top15['Energy Supply per Capita'].corr(Top15['Citable docs per Capita'])

It works like this:

Top15['Citable docs per Capita']=np.float64(Top15['Citable docs per Capita'])

Top15['Energy Supply per Capita']=np.float64(Top15['Energy Supply per Capita'])

Top15['Energy Supply per Capita'].corr(Top15['Citable docs per Capita'])

回答 6

我通过更改数据类型解决了这个问题。如果您看到“人均能源供应”是数字类型,而“人均城市文档”则是对象类型。我使用astype将列转换为float。我曾与一些NP功能相同的问题:count_nonzerosum合作,同时meanstd没有。

I solved this problem by changing the data type. If you see the ‘Energy Supply per Capita’ is a numerical type while the ‘Citable docs per Capita’ is an object type. I converted the column to float using astype. I had the same problem with some np functions: count_nonzero and sum worked while mean and std didn’t.


回答 7

在关联之前将“人均Citable docs”更改为数字可以解决该问题。

    Top15['Citable docs per Capita'] = pd.to_numeric(Top15['Citable docs per Capita'])
    data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
    correlation = data.corr(method='pearson')

changing ‘Citable docs per Capita’ to numeric before correlation will solve the problem.

    Top15['Citable docs per Capita'] = pd.to_numeric(Top15['Citable docs per Capita'])
    data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
    correlation = data.corr(method='pearson')

使用pandas GroupBy.agg()对同一列进行多次聚合

问题:使用pandas GroupBy.agg()对同一列进行多次聚合

是否有熊猫内置的方法将两个不同的聚合函数f1, f2应用于同一列df["returns"],而无需agg()多次调用?

示例数据框:

import pandas as pd
import datetime as dt

pd.np.random.seed(0)
df = pd.DataFrame({
         "date"    :  [dt.date(2012, x, 1) for x in range(1, 11)], 
         "returns" :  0.05 * np.random.randn(10), 
         "dummy"   :  np.repeat(1, 10)
}) 

语法上错误但直观上正确的方法是:

# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})

显然,Python不允许重复的键。还有其他表达方式agg()吗?也许元组列表[(column, function)]可以更好地工作,以允许将多个函数应用于同一列?但agg()似乎它只接受字典。

除了定义仅在其中应用这两个功能的辅助功能之外,还有其他解决方法吗?(无论如何,这如何与聚合一起使用?)

Is there a pandas built-in way to apply two different aggregating functions f1, f2 to the same column df["returns"], without having to call agg() multiple times?

Example dataframe:

import pandas as pd
import datetime as dt

pd.np.random.seed(0)
df = pd.DataFrame({
         "date"    :  [dt.date(2012, x, 1) for x in range(1, 11)], 
         "returns" :  0.05 * np.random.randn(10), 
         "dummy"   :  np.repeat(1, 10)
}) 

The syntactically wrong, but intuitively right, way to do it would be:

# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})

Obviously, Python doesn’t allow duplicate keys. Is there any other manner for expressing the input to agg()? Perhaps a list of tuples [(column, function)] would work better, to allow multiple functions applied to the same column? But agg() seems like it only accepts a dictionary.

Is there a workaround for this besides defining an auxiliary function that just applies both of the functions inside of it? (How would this work with aggregation anyway?)


回答 0

您可以简单地将函数作为列表传递:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

或作为字典:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

You can simply pass the functions as a list:

In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:         
           mean       sum
dummy                    
1      0.036901  0.369012

or as a dictionary:

In [21]: df.groupby('dummy').agg({'returns':
                                  {'Mean': np.mean, 'Sum': np.sum}})
Out[21]: 
        returns          
           Mean       Sum
dummy                    
1      0.036901  0.369012

回答 1

TLDR;Pandas groupby.agg具有一种新的,更简单的语法,用于指定(1)多列聚合,以及(2)一列多个聚合。因此,要对大于等于0.25的熊猫执行此操作,请使用

df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))

           Mean       Sum
dummy                    
1      0.036901  0.369012

要么

df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')

           Mean       Sum
dummy                    
1      0.036901  0.369012

大熊猫> = 0.25:命名汇总

熊猫已经改变了行为,GroupBy.agg转而使用更直观的语法来指定命名聚合。请参阅0.25文档部分中的增强功能以及相关的GitHub问题GH18366GH26512

从文档中

为了支持特定于列的聚合并控制输出列名称,pandas接受特殊的语法GroupBy.agg(),称为“命名聚合”,其中

  • 关键字是输出列名称
  • 值是元组,其第一个元素是要选择的列,第二个元素是要应用于该列的聚合。Pandas为pandas.NamedAgg namedtuple提供了字段[‘column’,’aggfunc’],以使参数更清晰。像往常一样,聚合可以是可调用的或字符串别名。

您现在可以通过关键字参数传递一个元组。元组遵循的格式(<colName>, <aggFunc>)

import pandas as pd

pd.__version__                                                                                                                            
# '0.25.0.dev0+840.g989f912ee'

# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]
})

df.groupby('kind').agg(
    max_height=('height', 'max'), min_weight=('weight', 'min'),)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

另外,您可以使用pd.NamedAgg(本质上是namedtuple)使事情更明确。

df.groupby('kind').agg(
    max_height=pd.NamedAgg(column='height', aggfunc='max'), 
    min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

对于Series来说甚至更简单,只需将aggfunc传递给关键字参数即可。

df.groupby('kind')['height'].agg(max_height='max', min_height='min')    

      max_height  min_height
kind                        
cat          9.5         9.1
dog         34.0         6.0       

最后,如果您的列名不是有效的python标识符,请使用带有解包功能的字典:

df.groupby('kind')['height'].agg(**{'max height': 'max', ...})

熊猫<0.25

在最新版本的熊猫(最高可达0.24)中,如果使用字典为聚合输出指定列名,则会得到FutureWarning

df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed 
# in a future version

v0.20中不推荐使用字典重命名列。在较新版本的熊猫上,可以通过传递元组列表来更简单地指定它。如果以这种方式指定函数,则该列的所有函数都必须指定为(名称,函数)对的元组。

df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})

        returns          
            op1       op2
dummy                    
1      0.328953  0.032895

要么,

df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])

            op1       op2
dummy                    
1      0.328953  0.032895

TLDR; Pandas groupby.agg has a new, easier syntax for specifying (1) aggregations on multiple columns, and (2) multiple aggregations on a column. So, to do this for pandas >= 0.25, use

df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))

           Mean       Sum
dummy                    
1      0.036901  0.369012

OR

df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')

           Mean       Sum
dummy                    
1      0.036901  0.369012

Pandas >= 0.25: Named Aggregation

Pandas has changed the behavior of GroupBy.agg in favour of a more intuitive syntax for specifying named aggregations. See the 0.25 docs section on Enhancements as well as relevant GitHub issues GH18366 and GH26512.

From the documentation,

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names
  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields [‘column’, ‘aggfunc’] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

You can now pass a tuple via keyword arguments. The tuples follow the format of (<colName>, <aggFunc>).

import pandas as pd

pd.__version__                                                                                                                            
# '0.25.0.dev0+840.g989f912ee'

# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]
})

df.groupby('kind').agg(
    max_height=('height', 'max'), min_weight=('weight', 'min'),)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

Alternatively, you can use pd.NamedAgg (essentially a namedtuple) which makes things more explicit.

df.groupby('kind').agg(
    max_height=pd.NamedAgg(column='height', aggfunc='max'), 
    min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)

      max_height  min_weight
kind                        
cat          9.5         7.9
dog         34.0         7.5

It is even simpler for Series, just pass the aggfunc to a keyword argument.

df.groupby('kind')['height'].agg(max_height='max', min_height='min')    

      max_height  min_height
kind                        
cat          9.5         9.1
dog         34.0         6.0       

Lastly, if your column names aren’t valid python identifiers, use a dictionary with unpacking:

df.groupby('kind')['height'].agg(**{'max height': 'max', ...})

Pandas < 0.25

In more recent versions of pandas leading upto 0.24, if using a dictionary for specifying column names for the aggregation output, you will get a FutureWarning:

df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed 
# in a future version

Using a dictionary for renaming columns is deprecated in v0.20. On more recent versions of pandas, this can be specified more simply by passing a list of tuples. If specifying the functions this way, all functions for that column need to be specified as tuples of (name, function) pairs.

df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})

        returns          
            op1       op2
dummy                    
1      0.328953  0.032895

Or,

df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])

            op1       op2
dummy                    
1      0.328953  0.032895

回答 2

这样的事情会做:

In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]: 
              func2     func1
dummy                        
1     -4.263768e-16 -0.188565

Would something like this work:

In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]: 
              func2     func1
dummy                        
1     -4.263768e-16 -0.188565

将缺失的日期添加到熊猫数据框

问题:将缺失的日期添加到熊猫数据框

我的数据可以在给定日期包含多个事件,也可以在一个日期包含否事件。我接受这些事件,按日期计数并绘制它们。但是,当我绘制它们时,我的两个系列并不总是匹配。

idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()

在上面的代码中,idx变为30个日期范围。2013/09/01至2013/09/30但是S可能只有25或26天,因为在给定日期没有事件发生。然后,当我尝试绘制时,由于大小不匹配,我得到一个AssertionError:

fig, ax = plt.subplots()    
ax.bar(idx.to_pydatetime(), s, color='green')

解决这个问题的正确方法是什么?我是否要从IDX中删除没有值的日期,或者(我希望这样做)是将序列中缺少的日期添加为0(我希望这样做)?我希望有30天的完整图表(值为0)。如果这种方法正确,那么有关如何开始使用的任何建议?我需要某种动态reindex功能吗?

这是Sdf.groupby(['simpleDate']).size() )的代码段,请注意没有输入04和05。

09-02-2013     2
09-03-2013    10
09-06-2013     5
09-07-2013     1

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don’t always match.

idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()

In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013 However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:

fig, ax = plt.subplots()    
ax.bar(idx.to_pydatetime(), s, color='green')

What’s the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I’d rather do) is add to the series the missing date with a count of 0. I’d rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?

Here’s a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.

09-02-2013     2
09-03-2013    10
09-06-2013     5
09-07-2013     1

回答 0

您可以使用Series.reindex

import pandas as pd

idx = pd.date_range('09-01-2013', '09-30-2013')

s = pd.Series({'09-02-2013': 2,
               '09-03-2013': 10,
               '09-06-2013': 5,
               '09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
print(s)

Yield

2013-09-01     0
2013-09-02     2
2013-09-03    10
2013-09-04     0
2013-09-05     0
2013-09-06     5
2013-09-07     1
2013-09-08     0
...

You could use Series.reindex:

import pandas as pd

idx = pd.date_range('09-01-2013', '09-30-2013')

s = pd.Series({'09-02-2013': 2,
               '09-03-2013': 10,
               '09-06-2013': 5,
               '09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
print(s)

yields

2013-09-01     0
2013-09-02     2
2013-09-03    10
2013-09-04     0
2013-09-05     0
2013-09-06     5
2013-09-07     1
2013-09-08     0
...

回答 1

使用更快的解决方法.asfreq()。这不需要创建新索引即可在中调用.reindex()

# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'), 
                  pd.Timestamp('2012-05-04'), 
                  pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)

print(s.asfreq('D'))
2012-05-01    1.0
2012-05-02    NaN
2012-05-03    NaN
2012-05-04    2.0
2012-05-05    NaN
2012-05-06    3.0
Freq: D, dtype: float64

A quicker workaround is to use .asfreq(). This doesn’t require creation of a new index to call within .reindex().

# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'), 
                  pd.Timestamp('2012-05-04'), 
                  pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)

print(s.asfreq('D'))
2012-05-01    1.0
2012-05-02    NaN
2012-05-03    NaN
2012-05-04    2.0
2012-05-05    NaN
2012-05-06    3.0
Freq: D, dtype: float64

回答 2

一个问题是,reindex如果存在重复值,该操作将失败。假设我们正在处理带时间戳的数据,我们希望按日期将其编入索引:

df = pd.DataFrame({
    'timestamps': pd.to_datetime(
        ['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
    'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df

Yield

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-18  "2016-11-18 04:00:00"  d

由于2016-11-16日期重复,尝试重新编制索引:

all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)

失败与:

...
ValueError: cannot reindex from a duplicate axis

(这表示索引重复,而不是索引本身是重复项)

相反,我们可以使用.loc查找范围内所有日期的条目:

df.loc[all_days]

Yield

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-17  NaN                    NaN
2016-11-18  "2016-11-18 04:00:00"  d

fillna 如果需要,可用于色谱柱系列以填充空白。

One issue is that reindex will fail if there are duplicate values. Say we’re working with timestamped data, which we want to index by date:

df = pd.DataFrame({
    'timestamps': pd.to_datetime(
        ['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
    'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df

yields

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-18  "2016-11-18 04:00:00"  d

Due to the duplicate 2016-11-16 date, an attempt to reindex:

all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)

fails with:

...
ValueError: cannot reindex from a duplicate axis

(by this it means the index has duplicates, not that it is itself a dup)

Instead, we can use .loc to look up entries for all dates in range:

df.loc[all_days]

yields

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-17  NaN                    NaN
2016-11-18  "2016-11-18 04:00:00"  d

fillna can be used on the column series to fill blanks if needed.


回答 3

另一种方法是resample,除了缺少日期外,还可以处理重复的日期。例如:

df.resample('D').mean()

resample是一个类似的延迟操作,groupby因此您需要执行另一个操作。在这种情况下mean工作得很好,但你也可以使用许多其他的熊猫方法,如maxsum等。

这是原始数据,但带有“ 2013-09-03”的附加条目:

             val
date           
2013-09-02     2
2013-09-03    10
2013-09-03    20    <- duplicate date added to OP's data
2013-09-06     5
2013-09-07     1

结果如下:

             val
date            
2013-09-02   2.0
2013-09-03  15.0    <- mean of original values for 2013-09-03
2013-09-04   NaN    <- NaN b/c date not present in orig
2013-09-05   NaN    <- NaN b/c date not present in orig
2013-09-06   5.0
2013-09-07   1.0

我将遗漏的日期保留为NaN以便清楚地说明其工作原理,但是您可以fillna(0)根据OP的要求添加以零代替NaN的方法,也可以interpolate()根据相邻行使用类似非零值的填充方法。

An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:

df.resample('D').mean()

resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.

Here is the original data, but with an extra entry for ‘2013-09-03’:

             val
date           
2013-09-02     2
2013-09-03    10
2013-09-03    20    <- duplicate date added to OP's data
2013-09-06     5
2013-09-07     1

And here are the results:

             val
date            
2013-09-02   2.0
2013-09-03  15.0    <- mean of original values for 2013-09-03
2013-09-04   NaN    <- NaN b/c date not present in orig
2013-09-05   NaN    <- NaN b/c date not present in orig
2013-09-06   5.0
2013-09-07   1.0

I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.


回答 4

这是一种将缺失的日期填充到数据框中的好方法,您可以选择fill_valuedays_back填充和date_order排序对数据框进行排序的顺序():

def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):

    df.set_index(date_col_name,drop=True,inplace=True)
    df.index = pd.DatetimeIndex(df.index)
    d = datetime.now().date()
    d2 = d - timedelta(days = days_back)
    idx = pd.date_range(d2, d, freq = "D")
    df = df.reindex(idx,fill_value=fill_value)
    df[date_col_name] = pd.DatetimeIndex(df.index)

    return df

Here’s a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:

def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):

    df.set_index(date_col_name,drop=True,inplace=True)
    df.index = pd.DatetimeIndex(df.index)
    d = datetime.now().date()
    d2 = d - timedelta(days = days_back)
    idx = pd.date_range(d2, d, freq = "D")
    df = df.reindex(idx,fill_value=fill_value)
    df[date_col_name] = pd.DatetimeIndex(df.index)

    return df

使用熊猫合并时如何保持索引

问题:使用熊猫合并时如何保持索引

我想合并两个DataFrames,并保留第一帧的索引作为合并数据集中的索引。但是,当我进行合并时,所得的DataFrame具有整数索引。如何指定要保留左侧数据框中的索引?

In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3}, 
                          'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})

In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3}, 
                          'to_merge_on': {0: 1, 1: 3, 2: 5}})

In [6]: a
Out[6]:
   col1  to_merge_on
a     1            1
b     2            3
c     3            4

In [7]: b
Out[7]:
   col2  to_merge_on
0     1            1
1     2            3
2     3            5

In [8]: a.merge(b, how='left')
Out[8]:
   col1  to_merge_on  col2
0     1            1   1.0
1     2            3   2.0
2     3            4   NaN

In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')

编辑:切换到示例代码,可以轻松地复制

I would like to merge two DataFrames, and keep the index from the first frame as the index on the merged dataset. However, when I do the merge, the resulting DataFrame has integer index. How can I specify that I want to keep the index from the left data frame?

In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3}, 
                          'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})

In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3}, 
                          'to_merge_on': {0: 1, 1: 3, 2: 5}})

In [6]: a
Out[6]:
   col1  to_merge_on
a     1            1
b     2            3
c     3            4

In [7]: b
Out[7]:
   col2  to_merge_on
0     1            1
1     2            3
2     3            5

In [8]: a.merge(b, how='left')
Out[8]:
   col1  to_merge_on  col2
0     1            1   1.0
1     2            3   2.0
2     3            4   NaN

In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')

EDIT: Switched to example code that can be easily reproduced


回答 0

In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
       col1  to_merge_on  col2
index
a         1            1     1
b         2            3     2
c         3            4   NaN

注意:对于某些左合并操作,如果和之间存在多个匹配项,则最终可能会出现更多行ab并且需要进行重复数据删除(有关重复数据删除的文档)。这就是为什么熊猫不为您保留索引的原因。

In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
       col1  to_merge_on  col2
index
a         1            1     1
b         2            3     2
c         3            4   NaN

Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.


回答 1

您可以在左侧数据框上复制索引并进行合并。

a['copy_index'] = a.index
a.merge(b, how='left')

我发现在处理大型数据框和使用pd.merge_asof()(或dd.merge_asof())时,此简单方法非常有用。

当重置索引很昂贵(大数据帧)时,这种方法会更好。

You can make a copy of index on left dataframe and do merge.

a['copy_index'] = a.index
a.merge(b, how='left')

I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof()).

This approach would be superior when resetting index is expensive (large dataframe).


回答 2

有一个非pd.merge解决方案。使用mapset_index

In [1744]: a.assign(col2=a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
Out[1744]:
   col1  to_merge_on  col2
a     1            1   1.0
b     2            3   2.0
c     3            4   NaN

并且,不要index为索引引入虚拟名称。

There is a non-pd.merge solution using Series.map and DataFrame.set_index.

In: a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
In: a['col2']
Out:
   col1  to_merge_on  col2
a     1            1   1.0
b     2            3   2.0
c     3            4   NaN

This doesn’t introduce a dummy index name for the index.

Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.


回答 3

df1 = df1.merge(
        df2, how="inner", left_index=True, right_index=True
    )

这样可以保留df1的索引

df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)

This allows to preserve the index of df1


回答 4

认为我想出了一个不同的解决方案。我是根据左表的索引将左表与索引值连接在一起,将右表与列值连接在一起。我所做的是普通合并:

First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')

然后,我从合并表中检索了新的索引号,并将它们放在名为“情感行号”的新列中:

First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()

然后,我基于称为行号(我从左表索引加入的列值)的现有列,将索引手动设置回原始的左表索引:

First10ReviewsJoined.set_index('Line Number', inplace=True)

然后删除行号的索引名称,使其保持空白:

First10ReviewsJoined.index.name = None

也许有点破解,但似乎运行良好且相对简单。另外,猜测它会减少重复/混乱数据的风险。希望一切都有意义。

Think I’ve come up with a different solution. I was joining the left table on index value and the right table on a column value based off index of left table. What I did was a normal merge:

First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')

Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:

First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()

Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):

First10ReviewsJoined.set_index('Line Number', inplace=True)

Then removed the index name of Line Number so that it remains blank:

First10ReviewsJoined.index.name = None

Maybe a bit of a hack but seems to work well and relatively simple. Also, guess it reduces risk of duplicates/messing up your data. Hopefully that all makes sense.


回答 5

另一个简单的选择是将索引重命名为之前的名称:

a.merge(b, how="left").set_axis(a.index)

合并保留数据帧“ a”的顺序,但只是重置索引,因此可以保存以使用set_axis

another simple option is to rename the index to what was before:

a.merge(b, how="left").set_axis(a.index)

merge preserves the order at dataframe ‘a’, but just resets the index so it’s save to use set_axis


熊猫中的datetime dtypes read_csv

问题:熊猫中的datetime dtypes read_csv

我正在读取具有多个datetime列的csv文件。我需要在读取文件时设置数据类型,但是日期时间似乎是个问题。例如:

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = ['datetime', 'datetime', 'str', 'float']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

运行时出现错误:

TypeError:不了解数据类型“ datetime”

事后通过pandas.to_datetime()转换列不是一个选项,我不知道哪些列将是datetime对象。该信息可以更改,并且可以从通知我的dtypes列表的任何信息中获取。

另外,我尝试用numpy.genfromtxt加载csv文件,在该函数中设置dtypes,然后转换为pandas.dataframe,但它会使数据乱码。任何帮助是极大的赞赏!

I’m reading in a csv file with multiple datetime columns. I’d need to set the data types upon reading in the file, but datetimes appear to be a problem. For instance:

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = ['datetime', 'datetime', 'str', 'float']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

When run gives a error:

TypeError: data type “datetime” not understood

Converting columns after the fact, via pandas.to_datetime() isn’t an option I can’t know which columns will be datetime objects. That information can change and comes from whatever informs my dtypes list.

Alternatively, I’ve tried to load the csv file with numpy.genfromtxt, set the dtypes in that function, and then convert to a pandas.dataframe but it garbles the data. Any help is greatly appreciated!


回答 0

为什么它不起作用

没有为read_csv设置datetime dtype,因为csv文件只能包含字符串,整数和浮点数。

将dtype设置为datetime将使熊猫将datetime解释为对象,这意味着您将以字符串结尾。

熊猫解决这个问题的方法

pandas.read_csv()函数具有名为parse_dates

使用此功能,您可以使用默认date_parserdateutil.parser.parser)快速将字符串,浮点数或整数转换为日期时间

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = {'col1': 'str', 'col2': 'str', 'col3': 'str', 'col4': 'float'}
parse_dates = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes, parse_dates=parse_dates)

这将导致熊猫读取col1col2作为字符串,它们很可能是字符串(“ 2016-05-05”等),并且在读取字符串之后,每一列的date_parser都会对该字符串起作用,并返回该函数返回的任何内容。

定义自己的日期解析功能:

pandas.read_csv()函数具有名为date_parser

将其设置为lambda函数将使该特定函数可用于日期解析。

GOTCHA警告

您必须为其提供功能,而不是功能的执行,因此这是正确的

date_parser = pd.datetools.to_datetime

这是不正确的

date_parser = pd.datetools.to_datetime()

熊猫0.22更新

pd.datetools.to_datetime 已移至 date_parser = pd.to_datetime

谢谢@stackoverYC

Why it does not work

There is no datetime dtype to be set for read_csv as csv files can only contain strings, integers and floats.

Setting a dtype to datetime will make pandas interpret the datetime as an object, meaning you will end up with a string.

Pandas way of solving this

The pandas.read_csv() function has a keyword argument called parse_dates

Using this you can on the fly convert strings, floats or integers into datetimes using the default date_parser (dateutil.parser.parser)

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = {'col1': 'str', 'col2': 'str', 'col3': 'str', 'col4': 'float'}
parse_dates = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes, parse_dates=parse_dates)

This will cause pandas to read col1 and col2 as strings, which they most likely are (“2016-05-05” etc.) and after having read the string, the date_parser for each column will act upon that string and give back whatever that function returns.

Defining your own date parsing function:

The pandas.read_csv() function also has a keyword argument called date_parser

Setting this to a lambda function will make that particular function be used for the parsing of the dates.

GOTCHA WARNING

You have to give it the function, not the execution of the function, thus this is Correct

date_parser = pd.datetools.to_datetime

This is incorrect:

date_parser = pd.datetools.to_datetime()

Pandas 0.22 Update

pd.datetools.to_datetime has been relocated to date_parser = pd.to_datetime

Thanks @stackoverYC


回答 1

有一个parse_dates参数read_csv可让您定义要视为日期或日期时间的列的名称:

date_cols = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=date_cols)

There is a parse_dates parameter for read_csv which allows you to define the names of the columns you want treated as dates or datetimes:

date_cols = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=date_cols)

回答 2

您可以尝试传递实际类型而不是字符串。

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

但是,如果没有任何可修改的数据,将很难诊断出来。

实际上,您可能希望熊猫将日期解析为时间戳记,因此可能是:

pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=True)

You might try passing actual types instead of strings.

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

But it’s going to be really hard to diagnose this without any of your data to tinker with.

And really, you probably want pandas to parse the the dates into TimeStamps, so that might be:

pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=True)

回答 3

我尝试使用dtypes = [datetime,…]选项,但是

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

我遇到以下错误:

TypeError: data type not understood

我唯一要做的更改是将datetime替换为datetime.datetime

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime.datetime, datetime.datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

I tried using the dtypes=[datetime, …] option, but

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

I encountered the following error:

TypeError: data type not understood

The only change I had to make is to replace datetime with datetime.datetime

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime.datetime, datetime.datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

如何从熊猫的两列中形成元组列

问题:如何从熊猫的两列中形成元组列

我有一个Pandas DataFrame,我想将’lat’和’long’列组合成一个元组。

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205482 entries, 0 to 209018
Data columns:
Month           205482  non-null values
Reported by     205482  non-null values
Falls within    205482  non-null values
Easting         205482  non-null values
Northing        205482  non-null values
Location        205482  non-null values
Crime type      205482  non-null values
long            205482  non-null values
lat             205482  non-null values
dtypes: float64(4), object(5)

我尝试使用的代码是:

def merge_two_cols(series): 
    return (series['lat'], series['long'])

sample['lat_long'] = sample.apply(merge_two_cols, axis=1)

但是,这返回以下错误:

---------------------------------------------------------------------------
 AssertionError                            Traceback (most recent call last)
<ipython-input-261-e752e52a96e6> in <module>()
      2     return (series['lat'], series['long'])
      3 
----> 4 sample['lat_long'] = sample.apply(merge_two_cols, axis=1)
      5

AssertionError: Block shape incompatible with manager 

我怎么解决这个问题?

I’ve got a Pandas DataFrame and I want to combine the ‘lat’ and ‘long’ columns to form a tuple.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205482 entries, 0 to 209018
Data columns:
Month           205482  non-null values
Reported by     205482  non-null values
Falls within    205482  non-null values
Easting         205482  non-null values
Northing        205482  non-null values
Location        205482  non-null values
Crime type      205482  non-null values
long            205482  non-null values
lat             205482  non-null values
dtypes: float64(4), object(5)

The code I tried to use was:

def merge_two_cols(series): 
    return (series['lat'], series['long'])

sample['lat_long'] = sample.apply(merge_two_cols, axis=1)

However, this returned the following error:

---------------------------------------------------------------------------
 AssertionError                            Traceback (most recent call last)
<ipython-input-261-e752e52a96e6> in <module>()
      2     return (series['lat'], series['long'])
      3 
----> 4 sample['lat_long'] = sample.apply(merge_two_cols, axis=1)
      5

AssertionError: Block shape incompatible with manager 

How can I solve this problem?


回答 0

适应吧zip。在处理列数据时,它很方便。

df['new_col'] = list(zip(df.lat, df.long))

与使用apply或相比,它不那么复杂且速度更快map。诸如此类的np.dstack速度是的两倍zip,但不会给您元组。

Get comfortable with zip. It comes in handy when dealing with column data.

df['new_col'] = list(zip(df.lat, df.long))

It’s less complicated and faster than using apply or map. Something like np.dstack is twice as fast as zip, but wouldn’t give you tuples.


回答 1

In [10]: df
Out[10]:
          A         B       lat      long
0  1.428987  0.614405  0.484370 -0.628298
1 -0.485747  0.275096  0.497116  1.047605
2  0.822527  0.340689  2.120676 -2.436831
3  0.384719 -0.042070  1.426703 -0.634355
4 -0.937442  2.520756 -1.662615 -1.377490
5 -0.154816  0.617671 -0.090484 -0.191906
6 -0.705177 -1.086138 -0.629708  1.332853
7  0.637496 -0.643773 -0.492668 -0.777344
8  1.109497 -0.610165  0.260325  2.533383
9 -1.224584  0.117668  1.304369 -0.152561

In [11]: df['lat_long'] = df[['lat', 'long']].apply(tuple, axis=1)

In [12]: df
Out[12]:
          A         B       lat      long                             lat_long
0  1.428987  0.614405  0.484370 -0.628298      (0.484370195967, -0.6282975278)
1 -0.485747  0.275096  0.497116  1.047605      (0.497115615839, 1.04760475074)
2  0.822527  0.340689  2.120676 -2.436831      (2.12067574274, -2.43683074367)
3  0.384719 -0.042070  1.426703 -0.634355      (1.42670326172, -0.63435462504)
4 -0.937442  2.520756 -1.662615 -1.377490     (-1.66261469102, -1.37749004179)
5 -0.154816  0.617671 -0.090484 -0.191906  (-0.0904840623396, -0.191905582481)
6 -0.705177 -1.086138 -0.629708  1.332853     (-0.629707821728, 1.33285348929)
7  0.637496 -0.643773 -0.492668 -0.777344   (-0.492667604075, -0.777344111021)
8  1.109497 -0.610165  0.260325  2.533383        (0.26032456699, 2.5333825651)
9 -1.224584  0.117668  1.304369 -0.152561     (1.30436900612, -0.152560909725)
In [10]: df
Out[10]:
          A         B       lat      long
0  1.428987  0.614405  0.484370 -0.628298
1 -0.485747  0.275096  0.497116  1.047605
2  0.822527  0.340689  2.120676 -2.436831
3  0.384719 -0.042070  1.426703 -0.634355
4 -0.937442  2.520756 -1.662615 -1.377490
5 -0.154816  0.617671 -0.090484 -0.191906
6 -0.705177 -1.086138 -0.629708  1.332853
7  0.637496 -0.643773 -0.492668 -0.777344
8  1.109497 -0.610165  0.260325  2.533383
9 -1.224584  0.117668  1.304369 -0.152561

In [11]: df['lat_long'] = df[['lat', 'long']].apply(tuple, axis=1)

In [12]: df
Out[12]:
          A         B       lat      long                             lat_long
0  1.428987  0.614405  0.484370 -0.628298      (0.484370195967, -0.6282975278)
1 -0.485747  0.275096  0.497116  1.047605      (0.497115615839, 1.04760475074)
2  0.822527  0.340689  2.120676 -2.436831      (2.12067574274, -2.43683074367)
3  0.384719 -0.042070  1.426703 -0.634355      (1.42670326172, -0.63435462504)
4 -0.937442  2.520756 -1.662615 -1.377490     (-1.66261469102, -1.37749004179)
5 -0.154816  0.617671 -0.090484 -0.191906  (-0.0904840623396, -0.191905582481)
6 -0.705177 -1.086138 -0.629708  1.332853     (-0.629707821728, 1.33285348929)
7  0.637496 -0.643773 -0.492668 -0.777344   (-0.492667604075, -0.777344111021)
8  1.109497 -0.610165  0.260325  2.533383        (0.26032456699, 2.5333825651)
9 -1.224584  0.117668  1.304369 -0.152561     (1.30436900612, -0.152560909725)

回答 2

熊猫有itertuples方法做到这一点:

list(df[['lat', 'long']].itertuples(index=False, name=None))

Pandas has the itertuples method to do exactly this:

list(df[['lat', 'long']].itertuples(index=False, name=None))

回答 3

我想补充一下df.values.tolist()。(只要您不介意获取列表列而不是元组)

import pandas as pd
import numpy as np

size = int(1e+07)
df = pd.DataFrame({'a': np.random.rand(size), 'b': np.random.rand(size)}) 

%timeit df.values.tolist()
1.47 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit list(zip(df.a,df.b))
1.92 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I’d like to add df.values.tolist(). (as long as you don’t mind to get a column of lists rather than tuples)

import pandas as pd
import numpy as np

size = int(1e+07)
df = pd.DataFrame({'a': np.random.rand(size), 'b': np.random.rand(size)}) 

%timeit df.values.tolist()
1.47 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit list(zip(df.a,df.b))
1.92 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

如何估算熊猫的DataFrame需要多少内存?

问题:如何估算熊猫的DataFrame需要多少内存?

我一直在想…如果我正在将400MB的csv文件读入熊猫数据帧(使用read_csv或read_table),是否有任何方法可以估算出这将需要多少内存?只是试图更好地了解数据帧和内存…

I have been wondering… If I am reading, say, a 400MB csv file into a pandas dataframe (using read_csv or read_table), is there any way to guesstimate how much memory this will need? Just trying to get a better feel of data frames and memory…


回答 0

df.memory_usage() 将返回每列占用多少:

>>> df.memory_usage()

Row_ID            20906600
Household_ID      20906600
Vehicle           20906600
Calendar_Year     20906600
Model_Year        20906600
...

要包含索引,请传递index=True

因此,要获得整体内存消耗:

>>> df.memory_usage(index=True).sum()
731731000

此外,传递deep=True将启用更准确的内存使用情况报告,该报告说明了所包含对象的全部使用情况。

这是因为内存使用量不包括非数组元素if占用的内存deep=False(默认情况下)。

df.memory_usage() will return how many bytes each column occupies:

>>> df.memory_usage()

Row_ID            20906600
Household_ID      20906600
Vehicle           20906600
Calendar_Year     20906600
Model_Year        20906600
...

To include indexes, pass index=True.

So to get overall memory consumption:

>>> df.memory_usage(index=True).sum()
731731000

Also, passing deep=True will enable a more accurate memory usage report, that accounts for the full usage of the contained objects.

This is because memory usage does not include memory consumed by elements that are not components of the array if deep=False (default case).


回答 1

这是不同方法的比较- sys.getsizeof(df)最简单。

对于此示例,df是一个具有814行,11列(2个整数,9个对象)的数据帧-从427kb shapefile中读取

sys.getsizeof(df)

>>>导入系统
>>> sys.getsizeof(df)
(给出的结果以字节为单位)
462456

df.memory_usage()

>>> df.memory_usage()
...
(以8字节/行列出每一列)

>>> df.memory_usage()。sum()
71712
(大约行*列* 8字节)

>>> df.memory_usage(deep = True)
(列出每列的全部内存使用情况)

>>> df.memory_usage(deep = True).sum()
(给出的结果以字节为单位)
462432

df.info()

将数据框信息打印到标准输出。从技术上讲,它们是千字节(KiB),而不是千字节-正如文档字符串所说,“内存使用情况以人类可读的单位(以2为基数的表示形式)显示”。因此,要获取字节将乘以1024,例如451.6 KiB = 462,438字节。

>>> df.info()
...
内存使用量:70.0+ KB

>>> df.info(memory_usage ='deep')
...
内存使用量:451.6 KB

Here’s a comparison of the different methods – sys.getsizeof(df) is simplest.

For this example, df is a dataframe with 814 rows, 11 columns (2 ints, 9 objects) – read from a 427kb shapefile

sys.getsizeof(df)

>>> import sys
>>> sys.getsizeof(df)
(gives results in bytes)
462456

df.memory_usage()

>>> df.memory_usage()
...
(lists each column at 8 bytes/row)

>>> df.memory_usage().sum()
71712
(roughly rows * cols * 8 bytes)

>>> df.memory_usage(deep=True)
(lists each column's full memory usage)

>>> df.memory_usage(deep=True).sum()
(gives results in bytes)
462432

df.info()

Prints dataframe info to stdout. Technically these are kibibytes (KiB), not kilobytes – as the docstring says, “Memory usage is shown in human-readable units (base-2 representation).” So to get bytes would multiply by 1024, e.g. 451.6 KiB = 462,438 bytes.

>>> df.info()
...
memory usage: 70.0+ KB

>>> df.info(memory_usage='deep')
...
memory usage: 451.6 KB

回答 2

我想我可以带一些更多的数据来讨论。

我对此问题进行了一系列测试。

通过使用python resource包,我得到了进程的内存使用情况。

通过将csv写入StringIO缓冲区,我可以轻松地以字节为单位测量它的大小。

我进行了两个实验,每个实验创建20个数据框,这些数据框的大小在10,000行和1,000,000行之间递增。两者都有10列。

在第一个实验中,我仅在数据集中使用浮点数。

与csv文件相比,这是内存随行数变化的方式。(以兆字节为单位)

第二个实验我采用了相同的方法,但是数据集中的数据仅包含短字符串。

似乎csv的大小与数据帧的大小之间的关系可以相差很多,但是内存中的大小将始终以2-3的倍数增大(对于本实验中的帧大小)

我希望通过更多实验来完成此答案,如果您想让我尝试一些特别的事情,请发表评论。

I thought I would bring some more data to the discussion.

I ran a series of tests on this issue.

By using the python resource package I got the memory usage of my process.

And by writing the csv into a StringIO buffer, I could easily measure the size of it in bytes.

I ran two experiments, each one creating 20 dataframes of increasing sizes between 10,000 lines and 1,000,000 lines. Both having 10 columns.

In the first experiment I used only floats in my dataset.

This is how the memory increased in comparison to the csv file as a function of the number of lines. (Size in Megabytes)

The second experiment I had the same approach, but the data in the dataset consisted of only short strings.

It seems that the relation of the size of the csv and the size of the dataframe can vary quite a lot, but the size in memory will always be bigger by a factor of 2-3 (for the frame sizes in this experiment)

I would love to complete this answer with more experiments, please comment if you want me to try something special.


回答 3

您必须反向执行此操作。

In [4]: DataFrame(randn(1000000,20)).to_csv('test.csv')

In [5]: !ls -ltr test.csv
-rw-rw-r-- 1 users 399508276 Aug  6 16:55 test.csv

从技术上讲,内存与此有关(包括索引)

In [16]: df.values.nbytes + df.index.nbytes + df.columns.nbytes
Out[16]: 168000160

内存为168MB,文件大小为400MB,1M行包含20个浮点数

DataFrame(randn(1000000,20)).to_hdf('test.h5','df')

!ls -ltr test.h5
-rw-rw-r-- 1 users 168073944 Aug  6 16:57 test.h5

作为二进制HDF5文件写入时,更加紧凑

In [12]: DataFrame(randn(1000000,20)).to_hdf('test.h5','df',complevel=9,complib='blosc')

In [13]: !ls -ltr test.h5
-rw-rw-r-- 1 users 154727012 Aug  6 16:58 test.h5

数据是随机的,因此压缩没有太大帮助

You have to do this in reverse.

In [4]: DataFrame(randn(1000000,20)).to_csv('test.csv')

In [5]: !ls -ltr test.csv
-rw-rw-r-- 1 users 399508276 Aug  6 16:55 test.csv

Technically memory is about this (which includes the indexes)

In [16]: df.values.nbytes + df.index.nbytes + df.columns.nbytes
Out[16]: 168000160

So 168MB in memory with a 400MB file, 1M rows of 20 float columns

DataFrame(randn(1000000,20)).to_hdf('test.h5','df')

!ls -ltr test.h5
-rw-rw-r-- 1 users 168073944 Aug  6 16:57 test.h5

MUCH more compact when written as a binary HDF5 file

In [12]: DataFrame(randn(1000000,20)).to_hdf('test.h5','df',complevel=9,complib='blosc')

In [13]: !ls -ltr test.h5
-rw-rw-r-- 1 users 154727012 Aug  6 16:58 test.h5

The data was random, so compression doesn’t help too much


回答 4

如果知道dtype数组的,则可以直接计算存储数据所需的字节数+ Python对象本身的字节数。numpy数组的有用属性是nbytes。您可以DataFrame通过执行以下操作从熊猫数组中获取字节数

nbytes = sum(block.values.nbytes for block in df.blocks.values())

objectdtype数组为每个对象存储8个字节(对象dtype数组存储指向opaque的指针PyObject),因此如果csv中有字符串,则需要考虑read_csv将这些字符串转换为objectdtype数组并相应地调整计算的情况。

编辑:

有关的更多详细信息,请参见numpy标量类型页面object dtype。由于仅存储一个引用,因此您还需要考虑数组中对象的大小。如该页面所述,对象数组在某种程度上类似于Python list对象。

If you know the dtypes of your array then you can directly compute the number of bytes that it will take to store your data + some for the Python objects themselves. A useful attribute of numpy arrays is nbytes. You can get the number of bytes from the arrays in a pandas DataFrame by doing

nbytes = sum(block.values.nbytes for block in df.blocks.values())

object dtype arrays store 8 bytes per object (object dtype arrays store a pointer to an opaque PyObject), so if you have strings in your csv you need to take into account that read_csv will turn those into object dtype arrays and adjust your calculations accordingly.

EDIT:

See the numpy scalar types page for more details on the object dtype. Since only a reference is stored you need to take into account the size of the object in the array as well. As that page says, object arrays are somewhat similar to Python list objects.


回答 5

就在这里。熊猫会将您的数据存储在二维numpy ndarray结构中,并按dtypes将其分组。ndarray基本上是带有小标头的原始C数据数组。因此,您可以通过将dtype其包含的大小乘以数组的大小来估算其大小。

例如:如果您有1000行2 列np.int32和5 np.float64列,则DataFrame将具有np.int32一个2×1000 np.float64数组和一个5×1000 数组,即:

4bytes * 2 * 1000 + 8bytes * 5 * 1000 = 48000字节

Yes there is. Pandas will store your data in 2 dimensional numpy ndarray structures grouping them by dtypes. ndarray is basically a raw C array of data with a small header. So you can estimate it’s size just by multiplying the size of the dtype it contains with the dimensions of the array.

For example: if you have 1000 rows with 2 np.int32 and 5 np.float64 columns, your DataFrame will have one 2×1000 np.int32 array and one 5×1000 np.float64 array which is:

4bytes*2*1000 + 8bytes*5*1000 = 48000 bytes


回答 6

我相信这可以为python中的任何对象提供内存中的大小。需要检查熊猫和numpy的内部

>>> import sys
#assuming the dataframe to be df 
>>> sys.getsizeof(df) 
59542497

This I believe this gives the in-memory size any object in python. Internals need to be checked with regard to pandas and numpy

>>> import sys
#assuming the dataframe to be df 
>>> sys.getsizeof(df) 
59542497

python pandas删除重复的列

问题:python pandas删除重复的列

从数据框中删除重复列的最简单方法是什么?

我正在通过以下方式读取具有重复列的文本文件:

import pandas as pd

df=pd.read_table(fname)

列名是:

Time, Time Relative, N2, Time, Time Relative, H2, etc...

所有“时间”和“相对时间”列均包含相同的数据。我想要:

Time, Time Relative, N2, H2

我所有的删除,删除等尝试,例如:

df=df.T.drop_duplicates().T

导致唯一值索引错误:

Reindexing only valid with uniquely valued index objects

很抱歉成为熊猫的菜鸟。任何建议,将不胜感激。


额外细节

熊猫版本:0.9.0
Python版本:2.7.3
Windows 7
(通过Pythonxy 2.7.3.0安装)

数据文件(注意:在实际文件中,列由制表符分隔,此处它们由4个空格分隔):

Time    Time Relative [s]    N2[%]    Time    Time Relative [s]    H2[ppm]
2/12/2013 9:20:55 AM    6.177    9.99268e+001    2/12/2013 9:20:55 AM    6.177    3.216293e-005    
2/12/2013 9:21:06 AM    17.689    9.99296e+001    2/12/2013 9:21:06 AM    17.689    3.841667e-005    
2/12/2013 9:21:18 AM    29.186    9.992954e+001    2/12/2013 9:21:18 AM    29.186    3.880365e-005    
... etc ...
2/12/2013 2:12:44 PM    17515.269    9.991756+001    2/12/2013 2:12:44 PM    17515.269    2.800279e-005    
2/12/2013 2:12:55 PM    17526.769    9.991754e+001    2/12/2013 2:12:55 PM    17526.769    2.880386e-005
2/12/2013 2:13:07 PM    17538.273    9.991797e+001    2/12/2013 2:13:07 PM    17538.273    3.131447e-005

What is the easiest way to remove duplicate columns from a dataframe?

I am reading a text file that has duplicate columns via:

import pandas as pd

df=pd.read_table(fname)

The column names are:

Time, Time Relative, N2, Time, Time Relative, H2, etc...

All the Time and Time Relative columns contain the same data. I want:

Time, Time Relative, N2, H2

All my attempts at dropping, deleting, etc such as:

df=df.T.drop_duplicates().T

Result in uniquely valued index errors:

Reindexing only valid with uniquely valued index objects

Sorry for being a Pandas noob. Any Suggestions would be appreciated.


Additional Details

Pandas version: 0.9.0
Python Version: 2.7.3
Windows 7
(installed via Pythonxy 2.7.3.0)

data file (note: in the real file, columns are separated by tabs, here they are separated by 4 spaces):

Time    Time Relative [s]    N2[%]    Time    Time Relative [s]    H2[ppm]
2/12/2013 9:20:55 AM    6.177    9.99268e+001    2/12/2013 9:20:55 AM    6.177    3.216293e-005    
2/12/2013 9:21:06 AM    17.689    9.99296e+001    2/12/2013 9:21:06 AM    17.689    3.841667e-005    
2/12/2013 9:21:18 AM    29.186    9.992954e+001    2/12/2013 9:21:18 AM    29.186    3.880365e-005    
... etc ...
2/12/2013 2:12:44 PM    17515.269    9.991756+001    2/12/2013 2:12:44 PM    17515.269    2.800279e-005    
2/12/2013 2:12:55 PM    17526.769    9.991754e+001    2/12/2013 2:12:55 PM    17526.769    2.880386e-005
2/12/2013 2:13:07 PM    17538.273    9.991797e+001    2/12/2013 2:13:07 PM    17538.273    3.131447e-005

回答 0

有一个解决方案。如果某些列名重复并且您希望删除它们,则适用此规则:

df = df.loc[:,~df.columns.duplicated()]

这个怎么运作:

假设数据框的列是 ['alpha','beta','alpha']

df.columns.duplicated()返回一个布尔数组:a TrueFalse每列。如果是,False则该列名称在该点之前是唯一的;如果是,True则该列名称在前面已重复。例如,使用给定的示例,返回值为[False,False,True]

Pandas允许使用布尔值建立索引,从而仅选择True值。由于我们要保留不重复的列,因此需要翻转上面的布尔数组(即[True, True, False] = ~[False,False,True]

最后,df.loc[:,[True,True,False]]使用上述索引功能仅选择非重复列。

注意:以上内容仅检查列名称,而不检查列值。

There’s a one line solution to the problem. This applies if some column names are duplicated and you wish to remove them:

df = df.loc[:,~df.columns.duplicated()]

How it works:

Suppose the columns of the data frame are ['alpha','beta','alpha']

df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].

Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])

Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.

Note: the above only checks columns names, not column values.


回答 1

听起来您已经知道唯一的列名。如果是这样,那就df = df['Time', 'Time Relative', 'N2']行得通。

如果没有,您的解决方案应该可以工作:

In [101]: vals = np.random.randint(0,20, (4,3))
          vals
Out[101]:
array([[ 3, 13,  0],
       [ 1, 15, 14],
       [14, 19, 14],
       [19,  5,  1]])

In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] )
          df
Out[106]:
   Time  H1  N2  Time Relative  N2  Time
0     3  13   0              3  13     0
1     1  15  14              1  15    14
2    14  19  14             14  19    14
3    19   5   1             19   5     1

In [107]: df.T.drop_duplicates().T
Out[107]:
   Time  H1  N2
0     3  13   0
1     1  15  14
2    14  19  14
3    19   5   1

您可能有一些特定于您的数据的数据。如果您可以提供更多有关数据的详细信息,我们可以提供更多帮助。

编辑: 就像安迪所说,问题可能出在重复的列标题上。

对于示例表文件“ dummy.csv”,我组成了:

Time    H1  N2  Time    N2  Time Relative
3   13  13  3   13  0
1   15  15  1   15  14
14  19  19  14  19  14
19  5   5   19  5   1

使用read_table给出唯一的列并正常工作:

In [151]: df2 = pd.read_table('dummy.csv')
          df2
Out[151]:
         Time  H1  N2  Time.1  N2.1  Time Relative
      0     3  13  13       3    13              0
      1     1  15  15       1    15             14
      2    14  19  19      14    19             14
      3    19   5   5      19     5              1
In [152]: df2.T.drop_duplicates().T
Out[152]:
             Time  H1  Time Relative
          0     3  13              0
          1     1  15             14
          2    14  19             14
          3    19   5              1  

如果您的版本不适合您,则可以破解一个解决方案以使其独特:

In [169]: df2 = pd.read_table('dummy.csv', header=None)
          df2
Out[169]:
              0   1   2     3   4              5
        0  Time  H1  N2  Time  N2  Time Relative
        1     3  13  13     3  13              0
        2     1  15  15     1  15             14
        3    14  19  19    14  19             14
        4    19   5   5    19   5              1
In [171]: from collections import defaultdict
          col_counts = defaultdict(int)
          col_ix = df2.first_valid_index()
In [172]: cols = []
          for col in df2.ix[col_ix]:
              cnt = col_counts[col]
              col_counts[col] += 1
              suf = '_' + str(cnt) if cnt else ''
              cols.append(col + suf)
          cols
Out[172]:
          ['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative']
In [174]: df2.columns = cols
          df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
          Time  H1  N2 Time_1 N2_1 Time Relative
        1    3  13  13      3   13             0
        2    1  15  15      1   15            14
        3   14  19  19     14   19            14
        4   19   5   5     19    5             1
In [178]: df2.T.drop_duplicates().T
Out[178]:
          Time  H1 Time Relative
        1    3  13             0
        2    1  15            14
        3   14  19            14
        4   19   5             1 

It sounds like you already know the unique column names. If that’s the case, then df = df['Time', 'Time Relative', 'N2'] would work.

If not, your solution should work:

In [101]: vals = np.random.randint(0,20, (4,3))
          vals
Out[101]:
array([[ 3, 13,  0],
       [ 1, 15, 14],
       [14, 19, 14],
       [19,  5,  1]])

In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] )
          df
Out[106]:
   Time  H1  N2  Time Relative  N2  Time
0     3  13   0              3  13     0
1     1  15  14              1  15    14
2    14  19  14             14  19    14
3    19   5   1             19   5     1

In [107]: df.T.drop_duplicates().T
Out[107]:
   Time  H1  N2
0     3  13   0
1     1  15  14
2    14  19  14
3    19   5   1

You probably have something specific to your data that’s messing it up. We could give more help if there’s more details you could give us about the data.

Edit: Like Andy said, the problem is probably with the duplicate column titles.

For a sample table file ‘dummy.csv’ I made up:

Time    H1  N2  Time    N2  Time Relative
3   13  13  3   13  0
1   15  15  1   15  14
14  19  19  14  19  14
19  5   5   19  5   1

using read_table gives unique columns and works properly:

In [151]: df2 = pd.read_table('dummy.csv')
          df2
Out[151]:
         Time  H1  N2  Time.1  N2.1  Time Relative
      0     3  13  13       3    13              0
      1     1  15  15       1    15             14
      2    14  19  19      14    19             14
      3    19   5   5      19     5              1
In [152]: df2.T.drop_duplicates().T
Out[152]:
             Time  H1  Time Relative
          0     3  13              0
          1     1  15             14
          2    14  19             14
          3    19   5              1  

If your version doesn’t let your, you can hack together a solution to make them unique:

In [169]: df2 = pd.read_table('dummy.csv', header=None)
          df2
Out[169]:
              0   1   2     3   4              5
        0  Time  H1  N2  Time  N2  Time Relative
        1     3  13  13     3  13              0
        2     1  15  15     1  15             14
        3    14  19  19    14  19             14
        4    19   5   5    19   5              1
In [171]: from collections import defaultdict
          col_counts = defaultdict(int)
          col_ix = df2.first_valid_index()
In [172]: cols = []
          for col in df2.ix[col_ix]:
              cnt = col_counts[col]
              col_counts[col] += 1
              suf = '_' + str(cnt) if cnt else ''
              cols.append(col + suf)
          cols
Out[172]:
          ['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative']
In [174]: df2.columns = cols
          df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
          Time  H1  N2 Time_1 N2_1 Time Relative
        1    3  13  13      3   13             0
        2    1  15  15      1   15            14
        3   14  19  19     14   19            14
        4   19   5   5     19    5             1
In [178]: df2.T.drop_duplicates().T
Out[178]:
          Time  H1 Time Relative
        1    3  13             0
        2    1  15            14
        3   14  19            14
        4   19   5             1 

回答 2

对于大型DataFrame,转置效率很低。这是一个替代方案:

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []
    for t, v in groups.items():
        dcols = frame[v].to_dict(orient="list")

        vs = dcols.values()
        ks = dcols.keys()
        lvs = len(vs)

        for i in range(lvs):
            for j in range(i+1,lvs):
                if vs[i] == vs[j]: 
                    dups.append(ks[i])
                    break

    return dups       

像这样使用它:

dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)

编辑

一种高效的内存版本,可像其他任何值一样对待nans:

from pandas.core.common import array_equivalent

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []

    for t, v in groups.items():

        cs = frame[v].columns
        vs = frame[v]
        lcs = len(cs)

        for i in range(lcs):
            ia = vs.iloc[:,i].values
            for j in range(i+1, lcs):
                ja = vs.iloc[:,j].values
                if array_equivalent(ia, ja):
                    dups.append(cs[i])
                    break

    return dups

Transposing is inefficient for large DataFrames. Here is an alternative:

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []
    for t, v in groups.items():
        dcols = frame[v].to_dict(orient="list")

        vs = dcols.values()
        ks = dcols.keys()
        lvs = len(vs)

        for i in range(lvs):
            for j in range(i+1,lvs):
                if vs[i] == vs[j]: 
                    dups.append(ks[i])
                    break

    return dups       

Use it like this:

dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)

Edit

A memory efficient version that treats nans like any other value:

from pandas.core.common import array_equivalent

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []

    for t, v in groups.items():

        cs = frame[v].columns
        vs = frame[v]
        lcs = len(cs)

        for i in range(lcs):
            ia = vs.iloc[:,i].values
            for j in range(i+1, lcs):
                ja = vs.iloc[:,j].values
                if array_equivalent(ia, ja):
                    dups.append(cs[i])
                    break

    return dups

回答 3

如果我没有记错的话,下面的操作可以解决问题,而不会出现转置解决方案的内存问题,并且行数少于@kalu函数,并且保留所有类似名称的列中的第一列。

Cols = list(df.columns)
for i,item in enumerate(df.columns):
    if item in df.columns[:i]: Cols[i] = "toDROP"
df.columns = Cols
df = df.drop("toDROP",1)

If I’m not mistaken, the following does what was asked without the memory problems of the transpose solution and with fewer lines than @kalu ‘s function, keeping the first of any similarly named columns.

Cols = list(df.columns)
for i,item in enumerate(df.columns):
    if item in df.columns[:i]: Cols[i] = "toDROP"
df.columns = Cols
df = df.drop("toDROP",1)

回答 4

看来您在正确的道路上。这是您要寻找的一线客:

df.reset_index().T.drop_duplicates().T

但是,由于没有示例数据帧会产生引用的错误消息Reindexing only valid with uniquely valued index objects,因此很难确切说明解决问题的方法。如果恢复原始索引对您很重要,请执行以下操作:

original_index = df.index.names
df.reset_index().T.drop_duplicates().reset_index(original_index).T

It looks like you were on the right path. Here is the one-liner you were looking for:

df.reset_index().T.drop_duplicates().T

But since there is no example data frame that produces the referenced error message Reindexing only valid with uniquely valued index objects, it is tough to say exactly what would solve the problem. if restoring the original index is important to you do this:

original_index = df.index.names
df.reset_index().T.drop_duplicates().reset_index(original_index).T

回答 5

第一步:-读取第一行,即删除所有重复的列。

第二步:-最后仅读取该列。

cols = pd.read_csv("file.csv", header=None, nrows=1).iloc[0].drop_duplicates()
df = pd.read_csv("file.csv", usecols=cols)

First step:- Read first row i.e all columns the remove all duplicate columns.

Second step:- Finally read only that columns.

cols = pd.read_csv("file.csv", header=None, nrows=1).iloc[0].drop_duplicates()
df = pd.read_csv("file.csv", usecols=cols)

回答 6

我遇到了这个问题,第一个答案提供的衬里效果很好。但是,我的麻烦之处在于该列的第二个副本包含所有数据。第一份没有。

解决方案是通过切换否定运算符拆分一个数据帧来创建两个数据帧。拥有两个数据框后,我使用lsuffix。这样,我就可以引用和删除没有数据的列。

-E

I ran into this problem where the one liner provided by the first answer worked well. However, I had the extra complication where the second copy of the column had all of the data. The first copy did not.

The solution was to create two data frames by splitting the one data frame by toggling the negation operator. Once I had the two data frames, I ran a join statement using the lsuffix. This way, I could then reference and delete the column without the data.

– E


回答 7

下面的方法将识别重复列,以查看最初构建数据框时出了什么问题。

dupes = pd.DataFrame(df.columns)
dupes[dupes.duplicated()]

The way below will identify dupe columns to review what is going wrong building the dataframe originally.

dupes = pd.DataFrame(df.columns)
dupes[dupes.duplicated()]

回答 8

通过其值删除重复列的快速简便方法:

df = df.T.drop_duplicates()。T

更多信息:Pandas DataFrame drop_duplicates manual

Fast and easy way to drop the duplicated columns by their values:

df = df.T.drop_duplicates().T

More info: Pandas DataFrame drop_duplicates manual .


如何使用python中的pandas获取所有重复项的列表?

问题:如何使用python中的pandas获取所有重复项的列表?

我列出了可能存在一些出口问题的物品清单。我想获得重复项的列表,以便可以手动比较它们。当我尝试使用pandas 重复方法时,它仅返回第一个重复。有没有办法获取所有重复项,而不仅仅是第一个?

我的数据集的一个小部分看起来像这样:

ID,ENROLLMENT_DATE,TRAINER_MANAGING,TRAINER_OPERATOR,FIRST_VISIT_DATE
1536D,12-Feb-12,"06DA1B3-Lebanon NH",,15-Feb-12
F15D,18-May-12,"06405B2-Lebanon NH",,25-Jul-12
8096,8-Aug-12,"0643D38-Hanover NH","0643D38-Hanover NH",25-Jun-12
A036,1-Apr-12,"06CB8CF-Hanover NH","06CB8CF-Hanover NH",9-Aug-12
8944,19-Feb-12,"06D26AD-Hanover NH",,4-Feb-12
1004E,8-Jun-12,"06388B2-Lebanon NH",,24-Dec-11
11795,3-Jul-12,"0649597-White River VT","0649597-White River VT",30-Mar-12
30D7,11-Nov-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",30-Nov-11
3AE2,21-Feb-12,"06405B2-Lebanon NH",,26-Oct-12
B0FE,17-Feb-12,"06D1B9D-Hartland VT",,16-Feb-12
127A1,11-Dec-11,"064456E-Hanover NH","064456E-Hanover NH",11-Nov-12
161FF,20-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",3-Jul-12
A036,30-Nov-11,"063B208-Randolph VT","063B208-Randolph VT",
475B,25-Sep-12,"06D26AD-Hanover NH",,5-Nov-12
151A3,7-Mar-12,"06388B2-Lebanon NH",,16-Nov-12
CA62,3-Jan-12,,,
D31B,18-Dec-11,"06405B2-Lebanon NH",,9-Jan-12
20F5,8-Jul-12,"0669C50-Randolph VT",,3-Feb-12
8096,19-Dec-11,"0649597-White River VT","0649597-White River VT",9-Apr-12
14E48,1-Aug-12,"06D3206-Hanover NH",,
177F8,20-Aug-12,"063B208-Randolph VT","063B208-Randolph VT",5-May-12
553E,11-Oct-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",8-Mar-12
12D5F,18-Jul-12,"0649597-White River VT","0649597-White River VT",2-Nov-12
C6DC,13-Apr-12,"06388B2-Lebanon NH",,
11795,27-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",19-Jun-12
17B43,11-Aug-12,,,22-Oct-12
A036,11-Aug-12,"06D3206-Hanover NH",,19-Jun-12

我的代码当前如下所示:

df_bigdata_duplicates = df_bigdata[df_bigdata.duplicated(cols='ID')]

那里有几个重复的物品。但是,当我使用上面的代码时,我只会得到第一项。在API参考中,我看到了如何获得最后一个项目,但是我希望拥有所有这些项目,因此我可以目视检查它们,以查看为什么我得到了差异。因此,在此示例中,我想获得所有三个A036条目以及11795条目和任何其他重复的条目,而不是仅第一个。任何帮助深表感谢。

I have a list of items that likely has some export issues. I would like to get a list of the duplicate items so I can manually compare them. When I try to use pandas duplicated method, it only returns the first duplicate. Is there a a way to get all of the duplicates and not just the first one?

A small subsection of my dataset looks like this:

ID,ENROLLMENT_DATE,TRAINER_MANAGING,TRAINER_OPERATOR,FIRST_VISIT_DATE
1536D,12-Feb-12,"06DA1B3-Lebanon NH",,15-Feb-12
F15D,18-May-12,"06405B2-Lebanon NH",,25-Jul-12
8096,8-Aug-12,"0643D38-Hanover NH","0643D38-Hanover NH",25-Jun-12
A036,1-Apr-12,"06CB8CF-Hanover NH","06CB8CF-Hanover NH",9-Aug-12
8944,19-Feb-12,"06D26AD-Hanover NH",,4-Feb-12
1004E,8-Jun-12,"06388B2-Lebanon NH",,24-Dec-11
11795,3-Jul-12,"0649597-White River VT","0649597-White River VT",30-Mar-12
30D7,11-Nov-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",30-Nov-11
3AE2,21-Feb-12,"06405B2-Lebanon NH",,26-Oct-12
B0FE,17-Feb-12,"06D1B9D-Hartland VT",,16-Feb-12
127A1,11-Dec-11,"064456E-Hanover NH","064456E-Hanover NH",11-Nov-12
161FF,20-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",3-Jul-12
A036,30-Nov-11,"063B208-Randolph VT","063B208-Randolph VT",
475B,25-Sep-12,"06D26AD-Hanover NH",,5-Nov-12
151A3,7-Mar-12,"06388B2-Lebanon NH",,16-Nov-12
CA62,3-Jan-12,,,
D31B,18-Dec-11,"06405B2-Lebanon NH",,9-Jan-12
20F5,8-Jul-12,"0669C50-Randolph VT",,3-Feb-12
8096,19-Dec-11,"0649597-White River VT","0649597-White River VT",9-Apr-12
14E48,1-Aug-12,"06D3206-Hanover NH",,
177F8,20-Aug-12,"063B208-Randolph VT","063B208-Randolph VT",5-May-12
553E,11-Oct-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",8-Mar-12
12D5F,18-Jul-12,"0649597-White River VT","0649597-White River VT",2-Nov-12
C6DC,13-Apr-12,"06388B2-Lebanon NH",,
11795,27-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",19-Jun-12
17B43,11-Aug-12,,,22-Oct-12
A036,11-Aug-12,"06D3206-Hanover NH",,19-Jun-12

My code looks like this currently:

df_bigdata_duplicates = df_bigdata[df_bigdata.duplicated(cols='ID')]

There area a couple duplicate items. But, when I use the above code, I only get the first item. In the API reference, I see how I can get the last item, but I would like to have all of them so I can visually inspect them to see why I am getting the discrepancy. So, in this example I would like to get all three A036 entries and both 11795 entries and any other duplicated entries, instead of the just first one. Any help is most appreciated.


回答 0

方法1:打印所有ID为重复ID之一的行:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort("ID")
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

但是我想不出一种防止重复ids很多次的好方法。我更喜欢groupbyID上的方法2 :。

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

Method #1: print all rows where the ID is one of the IDs in duplicated:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort("ID")
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

but I couldn’t think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

回答 1

在Pandas版本0.17中,您可以在重复函数中设置“ keep = False”,以获取所有重复项。

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(['a','b','c','d','a','b'])

In [3]: df
Out[3]: 
       0
    0  a
    1  b
    2  c
    3  d
    4  a
    5  b

In [4]: df[df.duplicated(keep=False)]
Out[4]: 
       0
    0  a
    1  b
    4  a
    5  b

With Pandas version 0.17, you can set ‘keep = False’ in the duplicated function to get all the duplicate items.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(['a','b','c','d','a','b'])

In [3]: df
Out[3]: 
       0
    0  a
    1  b
    2  c
    3  d
    4  a
    5  b

In [4]: df[df.duplicated(keep=False)]
Out[4]: 
       0
    0  a
    1  b
    4  a
    5  b

回答 2

df[df.duplicated(['ID'], keep=False)]

它将所有重复的行返回给您。

根据文件

keep:{‘first’,’last’,False},默认为’first’

  • first:将第一次出现的重复项标记为True。
  • last:将最后一次出现的重复项标记为True。
  • False:将所有重复项标记为True。
df[df.duplicated(['ID'], keep=False)]

it’ll return all duplicated rows back to you.

According to documentation:

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Mark duplicates as True except for the first occurrence.
  • last : Mark duplicates as True except for the last occurrence.
  • False : Mark all duplicates as True.

回答 3

由于我无法发表评论,因此将其发布为单独的答案

要在多个列的基础上查找重复项,请提及以下每个列名,它将返回所有已设置的重复行:

df[df[['product_uid', 'product_title', 'user']].duplicated() == True]

As I am unable to comment, hence posting as a separate answer

To find duplicates on the basis of more than one column, mention every column name as below, and it will return you all the duplicated rows set:

df[df[['product_uid', 'product_title', 'user']].duplicated() == True]

回答 4

df[df['ID'].duplicated() == True]

这对我有用

df[df['ID'].duplicated() == True]

This worked for me


回答 5

使用按元素进行逻辑运算或将pandas复制方法的take_last参数设置为True和False,您可以从数据框中获取一个包含所有重复项的集合。

df_bigdata_duplicates = 
    df_bigdata[df_bigdata.duplicated(cols='ID', take_last=False) |
               df_bigdata.duplicated(cols='ID', take_last=True)
              ]

Using an element-wise logical or and setting the take_last argument of the pandas duplicated method to both True and False you can obtain a set from your dataframe that includes all of the duplicates.

df_bigdata_duplicates = 
    df_bigdata[df_bigdata.duplicated(cols='ID', take_last=False) |
               df_bigdata.duplicated(cols='ID', take_last=True)
              ]

回答 6

这可能不是解决问题的方法,而是举例说明:

import pandas as pd

df = pd.DataFrame({
    'A': [1,1,3,4],
    'B': [2,2,5,6],
    'C': [3,4,7,6],
})

print(df)
df.duplicated(keep=False)
df.duplicated(['A','B'], keep=False)

输出:

   A  B  C
0  1  2  3
1  1  2  4
2  3  5  7
3  4  6  6

0    False
1    False
2    False
3    False
dtype: bool

0     True
1     True
2    False
3    False
dtype: bool

This may not be a solution to the question, but to illustrate examples:

import pandas as pd

df = pd.DataFrame({
    'A': [1,1,3,4],
    'B': [2,2,5,6],
    'C': [3,4,7,6],
})

print(df)
df.duplicated(keep=False)
df.duplicated(['A','B'], keep=False)

The outputs:

   A  B  C
0  1  2  3
1  1  2  4
2  3  5  7
3  4  6  6

0    False
1    False
2    False
3    False
dtype: bool

0     True
1     True
2    False
3    False
dtype: bool

回答 7

sort("ID")现在似乎无法正常工作,似乎已按照sort doc弃用,因此请sort_values("ID")改为使用重复过滤器进行排序,如下所示:

df[df.ID.duplicated(keep=False)].sort_values("ID")

sort("ID") does not seem to be working now, seems deprecated as per sort doc, so use sort_values("ID") instead to sort after duplicate filter, as following:

df[df.ID.duplicated(keep=False)].sort_values("ID")

回答 8

对于我的数据库,在对列进行排序之前,重复的(keep = False)不起作用。

data.sort_values(by=['Order ID'], inplace=True)
df = data[data['Order ID'].duplicated(keep=False)]

For my database duplicated(keep=False) did not work until the column was sorted.

data.sort_values(by=['Order ID'], inplace=True)
df = data[data['Order ID'].duplicated(keep=False)]

回答 9

df[df.duplicated(['ID'])==True].sort_values('ID')

df[df.duplicated(['ID'])==True].sort_values('ID')


如何使用Pandas创建随机整数的DataFrame?

问题:如何使用Pandas创建随机整数的DataFrame?

我知道如果我使用randn

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

给了我我想要的东西,但是带有正态分布的元素。但是,如果我只想要随机整数怎么办?

randint通过提供范围来工作,但不能像提供数组那样randn工作。那么我该如何使用某个范围之间的随机整数呢?

I know that if I use randn,

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

gives me what I am looking for, but with elements from a normal distribution. But what if I just wanted random integers?

randint works by providing a range, but not an array like randn does. So how do I do this with random integers between some range?


回答 0

numpy.random.randint接受第三个参数(size),您可以在其中指定输出数组的大小。您可以使用它来创建DataFrame

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

此处- np.random.randint(0,100,size=(100, 4))创建一个大小为的输出数组,(100,4)其中的随机整数元素在之间[0,100)


演示-

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

生成:

     A   B   C   D
0   45  88  44  92
1   62  34   2  86
2   85  65  11  31
3   74  43  42  56
4   90  38  34  93
5    0  94  45  10
6   58  23  23  60
..  ..  ..  ..  ..

numpy.random.randint accepts a third argument (size) , in which you can specify the size of the output array. You can use this to create your DataFrame

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

Here – np.random.randint(0,100,size=(100, 4)) – creates an output array of size (100,4) with random integer elements between [0,100) .


Demo –

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

which produces:

     A   B   C   D
0   45  88  44  92
1   62  34   2  86
2   85  65  11  31
3   74  43  42  56
4   90  38  34  93
5    0  94  45  10
6   58  23  23  60
..  ..  ..  ..  ..

回答 1

如今,建议使用NumPy创建随机整数的方法是使用numpy.random.Generator.integers。(文件

import numpy as np
import pandas as pd

rng = np.random.default_rng()
df = pd.DataFrame(rng.integers(0, 100, size=(100, 4)), columns=list('ABCD'))
df
----------------------
      A    B    C    D
 0   58   96   82   24
 1   21    3   35   36
 2   67   79   22   78
 3   81   65   77   94
 4   73    6   70   96
... ...  ...  ...  ...
95   76   32   28   51
96   33   68   54   77
97   76   43   57   43
98   34   64   12   57
99   81   77   32   50
100 rows × 4 columns

The recommended way to create random integers with NumPy these days is to use numpy.random.Generator.integers. (documentation)

import numpy as np
import pandas as pd

rng = np.random.default_rng()
df = pd.DataFrame(rng.integers(0, 100, size=(100, 4)), columns=list('ABCD'))
df
----------------------
      A    B    C    D
 0   58   96   82   24
 1   21    3   35   36
 2   67   79   22   78
 3   81   65   77   94
 4   73    6   70   96
... ...  ...  ...  ...
95   76   32   28   51
96   33   68   54   77
97   76   43   57   43
98   34   64   12   57
99   81   77   32   50
100 rows × 4 columns