熊猫的笛卡尔积

问题:熊猫的笛卡尔积

我有两个熊猫数据框:

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})     

获得笛卡尔积的最佳实践是什么(当然不用像我这样明确地写它)?

#df1, df2 cartesian product
df_cartesian = DataFrame({'col1':[1,2,1,2],'col2':[3,4,3,4],'col3':[5,5,6,6]})

I have two pandas dataframes:

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})     

What is the best practice to get their cartesian product (of course without writing it explicitly like me)?

#df1, df2 cartesian product
df_cartesian = DataFrame({'col1':[1,2,1,2],'col2':[3,4,3,4],'col3':[5,5,6,6]})

回答 0

如果每行都有一个重复的键,则可以使用merge生成笛卡尔乘积(就像在SQL中一样)。

from pandas import DataFrame, merge
df1 = DataFrame({'key':[1,1], 'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'key':[1,1], 'col3':[5,6]})

merge(df1, df2,on='key')[['col1', 'col2', 'col3']]

输出:

   col1  col2  col3
0     1     3     5
1     1     3     6
2     2     4     5
3     2     4     6

有关文档,请参见此处:http : //pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra

If you have a key that is repeated for each row, then you can produce a cartesian product using merge (like you would in SQL).

from pandas import DataFrame, merge
df1 = DataFrame({'key':[1,1], 'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'key':[1,1], 'col3':[5,6]})

merge(df1, df2,on='key')[['col1', 'col2', 'col3']]

Output:

   col1  col2  col3
0     1     3     5
1     1     3     6
2     2     4     5
3     2     4     6

See here for the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra


回答 1

使用pd.MultiIndex.from_product在人少的数据帧的索引,然后复位它的索引,就大功告成了。

a = [1, 2, 3]
b = ["a", "b", "c"]

index = pd.MultiIndex.from_product([a, b], names = ["a", "b"])

pd.DataFrame(index = index).reset_index()

出:

   a  b
0  1  a
1  1  b
2  1  c
3  2  a
4  2  b
5  2  c
6  3  a
7  3  b
8  3  c

Use pd.MultiIndex.from_product as an index in an otherwise empty dataframe, then reset its index, and you’re done.

a = [1, 2, 3]
b = ["a", "b", "c"]

index = pd.MultiIndex.from_product([a, b], names = ["a", "b"])

pd.DataFrame(index = index).reset_index()

out:

   a  b
0  1  a
1  1  b
2  1  c
3  2  a
4  2  b
5  2  c
6  3  a
7  3  b
8  3  c

回答 2

这不会赢得一场代码高尔夫比赛,它会借鉴先前的答案-但会清楚地显示出密钥的添加方式以及联接的工作方式。这将从列表中创建2个新数据框,然后添加用于执行笛卡尔乘积的键。

我的用例是,我需要在列表中每周列出所有商店ID的列表。因此,我创建了一个我想拥有的所有星期的列表,然后创建了一个我想用来映射它们的所有商店ID的列表。

我选择了合并,但在语义上与此设置中的内部相同。您可以在关于合并的文档中看到这一点,该文档指出如果两个表中的键组合均出现多次,则该操作将执行笛卡尔乘积。

days = pd.DataFrame({'date':list_of_days})
stores = pd.DataFrame({'store_id':list_of_stores})
stores['key'] = 0
days['key'] = 0
days_and_stores = days.merge(stores, how='left', on = 'key')
days_and_stores.drop('key',1, inplace=True)

This won’t win a code golf competition, and borrows from the previous answers – but clearly shows how the key is added, and how the join works. This creates 2 new data frames from lists, then adds the key to do the cartesian product on.

My use case was that I needed a list of all store IDs on for each week in my list. So, I created a list of all the weeks I wanted to have, then a list of all the store IDs I wanted to map them against.

The merge I chose left, but would be semantically the same as inner in this setup. You can see this in the documentation on merging, which states it does a Cartesian product if key combination appears more than once in both tables – which is what we set up.

days = pd.DataFrame({'date':list_of_days})
stores = pd.DataFrame({'store_id':list_of_stores})
stores['key'] = 0
days['key'] = 0
days_and_stores = days.merge(stores, how='left', on = 'key')
days_and_stores.drop('key',1, inplace=True)

回答 3

这需要最少的代码。创建一个通用的“键”以笛卡尔合并两者:

df1['key'] = 0
df2['key'] = 0

df_cartesian = df1.merge(df2, how='outer')

Minimal code needed for this one. Create a common ‘key’ to cartesian merge the two:

df1['key'] = 0
df2['key'] = 0

df_cartesian = df1.merge(df2, how='outer')

回答 4

使用方法链接:

product = (
    df1.assign(key=1)
    .merge(df2.assign(key=1), on="key")
    .drop("key", axis=1)
)

With method chaining:

product = (
    df1.assign(key=1)
    .merge(df2.assign(key=1), on="key")
    .drop("key", axis=1)
)

回答 5

另一种选择是,可以依靠itertools:提供的笛卡尔乘积itertools.product,避免创建临时键或修改索引:

import numpy as np 
import pandas as pd 
import itertools

def cartesian(df1, df2):
    rows = itertools.product(df1.iterrows(), df2.iterrows())

    df = pd.DataFrame(left.append(right) for (_, left), (_, right) in rows)
    return df.reset_index(drop=True)

快速测试:

In [46]: a = pd.DataFrame(np.random.rand(5, 3), columns=["a", "b", "c"])

In [47]: b = pd.DataFrame(np.random.rand(5, 3), columns=["d", "e", "f"])    

In [48]: cartesian(a,b)
Out[48]:
           a         b         c         d         e         f
0   0.436480  0.068491  0.260292  0.991311  0.064167  0.715142
1   0.436480  0.068491  0.260292  0.101777  0.840464  0.760616
2   0.436480  0.068491  0.260292  0.655391  0.289537  0.391893
3   0.436480  0.068491  0.260292  0.383729  0.061811  0.773627
4   0.436480  0.068491  0.260292  0.575711  0.995151  0.804567
5   0.469578  0.052932  0.633394  0.991311  0.064167  0.715142
6   0.469578  0.052932  0.633394  0.101777  0.840464  0.760616
7   0.469578  0.052932  0.633394  0.655391  0.289537  0.391893
8   0.469578  0.052932  0.633394  0.383729  0.061811  0.773627
9   0.469578  0.052932  0.633394  0.575711  0.995151  0.804567
10  0.466813  0.224062  0.218994  0.991311  0.064167  0.715142
11  0.466813  0.224062  0.218994  0.101777  0.840464  0.760616
12  0.466813  0.224062  0.218994  0.655391  0.289537  0.391893
13  0.466813  0.224062  0.218994  0.383729  0.061811  0.773627
14  0.466813  0.224062  0.218994  0.575711  0.995151  0.804567
15  0.831365  0.273890  0.130410  0.991311  0.064167  0.715142
16  0.831365  0.273890  0.130410  0.101777  0.840464  0.760616
17  0.831365  0.273890  0.130410  0.655391  0.289537  0.391893
18  0.831365  0.273890  0.130410  0.383729  0.061811  0.773627
19  0.831365  0.273890  0.130410  0.575711  0.995151  0.804567
20  0.447640  0.848283  0.627224  0.991311  0.064167  0.715142
21  0.447640  0.848283  0.627224  0.101777  0.840464  0.760616
22  0.447640  0.848283  0.627224  0.655391  0.289537  0.391893
23  0.447640  0.848283  0.627224  0.383729  0.061811  0.773627
24  0.447640  0.848283  0.627224  0.575711  0.995151  0.804567

As an alternative, one can rely on the cartesian product provided by itertools: itertools.product, which avoids creating a temporary key or modifying the index:

import numpy as np 
import pandas as pd 
import itertools

def cartesian(df1, df2):
    rows = itertools.product(df1.iterrows(), df2.iterrows())

    df = pd.DataFrame(left.append(right) for (_, left), (_, right) in rows)
    return df.reset_index(drop=True)

Quick test:

In [46]: a = pd.DataFrame(np.random.rand(5, 3), columns=["a", "b", "c"])

In [47]: b = pd.DataFrame(np.random.rand(5, 3), columns=["d", "e", "f"])    

In [48]: cartesian(a,b)
Out[48]:
           a         b         c         d         e         f
0   0.436480  0.068491  0.260292  0.991311  0.064167  0.715142
1   0.436480  0.068491  0.260292  0.101777  0.840464  0.760616
2   0.436480  0.068491  0.260292  0.655391  0.289537  0.391893
3   0.436480  0.068491  0.260292  0.383729  0.061811  0.773627
4   0.436480  0.068491  0.260292  0.575711  0.995151  0.804567
5   0.469578  0.052932  0.633394  0.991311  0.064167  0.715142
6   0.469578  0.052932  0.633394  0.101777  0.840464  0.760616
7   0.469578  0.052932  0.633394  0.655391  0.289537  0.391893
8   0.469578  0.052932  0.633394  0.383729  0.061811  0.773627
9   0.469578  0.052932  0.633394  0.575711  0.995151  0.804567
10  0.466813  0.224062  0.218994  0.991311  0.064167  0.715142
11  0.466813  0.224062  0.218994  0.101777  0.840464  0.760616
12  0.466813  0.224062  0.218994  0.655391  0.289537  0.391893
13  0.466813  0.224062  0.218994  0.383729  0.061811  0.773627
14  0.466813  0.224062  0.218994  0.575711  0.995151  0.804567
15  0.831365  0.273890  0.130410  0.991311  0.064167  0.715142
16  0.831365  0.273890  0.130410  0.101777  0.840464  0.760616
17  0.831365  0.273890  0.130410  0.655391  0.289537  0.391893
18  0.831365  0.273890  0.130410  0.383729  0.061811  0.773627
19  0.831365  0.273890  0.130410  0.575711  0.995151  0.804567
20  0.447640  0.848283  0.627224  0.991311  0.064167  0.715142
21  0.447640  0.848283  0.627224  0.101777  0.840464  0.760616
22  0.447640  0.848283  0.627224  0.655391  0.289537  0.391893
23  0.447640  0.848283  0.627224  0.383729  0.061811  0.773627
24  0.447640  0.848283  0.627224  0.575711  0.995151  0.804567

回答 6

如果没有重叠的列,不想添加一列,并且可以丢弃数据帧的索引,则可能会更容易:

df1.index[:] = df2.index[:] = 0
df_cartesian = df1.join(df2, how='outer')
df_cartesian.index[:] = range(len(df_cartesian))

If you have no overlapping columns, don’t want to add one, and the indices of the data frames can be discarded, this may be easier:

df1.index[:] = df2.index[:] = 0
df_cartesian = df1.join(df2, how='outer')
df_cartesian.index[:] = range(len(df_cartesian))

回答 7

这是一个帮助函数,用于执行带有两个数据帧的简单笛卡尔乘积。内部逻辑使用内部键进行处理,并避免从任一侧弄乱任何碰巧被命名为“键”的列。

import pandas as pd

def cartesian(df1, df2):
    """Determine Cartesian product of two data frames."""
    key = 'key'
    while key in df1.columns or key in df2.columns:
        key = '_' + key
    key_d = {key: 0}
    return pd.merge(
        df1.assign(**key_d), df2.assign(**key_d), on=key).drop(key, axis=1)

# Two data frames, where the first happens to have a 'key' column
df1 = pd.DataFrame({'number':[1, 2], 'key':[3, 4]})
df2 = pd.DataFrame({'digit': [5, 6]})
cartesian(df1, df2)

显示:

   number  key  digit
0       1    3      5
1       1    3      6
2       2    4      5
3       2    4      6

Here is a helper function to perform a simple Cartesian product with two data frames. The internal logic handles using an internal key, and avoids mangling any columns that happen to be named “key” from either side.

import pandas as pd

def cartesian(df1, df2):
    """Determine Cartesian product of two data frames."""
    key = 'key'
    while key in df1.columns or key in df2.columns:
        key = '_' + key
    key_d = {key: 0}
    return pd.merge(
        df1.assign(**key_d), df2.assign(**key_d), on=key).drop(key, axis=1)

# Two data frames, where the first happens to have a 'key' column
df1 = pd.DataFrame({'number':[1, 2], 'key':[3, 4]})
df2 = pd.DataFrame({'digit': [5, 6]})
cartesian(df1, df2)

shows:

   number  key  digit
0       1    3      5
1       1    3      6
2       2    4      5
3       2    4      6

回答 8

你可以采取的笛卡尔积启动df1.col1df2.col3,然后合并回df1得到col2

这是一个通用的笛卡尔乘积函数,它采用列表字典:

def cartesian_product(d):
    index = pd.MultiIndex.from_product(d.values(), names=d.keys())
    return pd.DataFrame(index=index).reset_index()

申请为:

res = cartesian_product({'col1': df1.col1, 'col3': df2.col3})
pd.merge(res, df1, on='col1')
#  col1 col3 col2
# 0   1    5    3
# 1   1    6    3
# 2   2    5    4
# 3   2    6    4

You could start by taking the Cartesian product of df1.col1 and df2.col3, then merge back to df1 to get col2.

Here’s a general Cartesian product function which takes a dictionary of lists:

def cartesian_product(d):
    index = pd.MultiIndex.from_product(d.values(), names=d.keys())
    return pd.DataFrame(index=index).reset_index()

Apply as:

res = cartesian_product({'col1': df1.col1, 'col3': df2.col3})
pd.merge(res, df1, on='col1')
#  col1 col3 col2
# 0   1    5    3
# 1   1    6    3
# 2   2    5    4
# 3   2    6    4

回答 9

您可以使用numpy,因为它可能更快。假设您有两个系列,如下所示:

s1 = pd.Series(np.random.randn(100,))
s2 = pd.Series(np.random.randn(100,))

您只需要,

pd.DataFrame(
    s1[:, None] @ s2[None, :], 
    index = s1.index, columns = s2.index
)

You can use numpy as it could be faster. Suppose you have two series as follows,

s1 = pd.Series(np.random.randn(100,))
s2 = pd.Series(np.random.randn(100,))

You just need,

pd.DataFrame(
    s1[:, None] @ s2[None, :], 
    index = s1.index, columns = s2.index
)

回答 10

我发现使用pandas MultiIndex是工作的最佳工具。如果您具有列表列表lists_list,请调用pd.MultiIndex.from_product(lists_list)并遍历结果(或在DataFrame索引中使用它)。

I find using pandas MultiIndex to be the best tool for the job. If you have a list of lists lists_list, call pd.MultiIndex.from_product(lists_list) and iterate over the result (or use it in DataFrame index).