问题:随机播放DataFrame行
我有以下DataFrame:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
...
20 7 8 9 2
21 10 11 12 2
...
45 13 14 15 3
46 16 17 18 3
...
从csv文件读取DataFrame。所有具有Type
1的行都在最上面,然后是具有Type
2 的行,然后是具有Type
3 的行,依此类推。
我想重新整理DataFrame行的顺序,以便将所有行Type
混合在一起。可能的结果可能是:
Col1 Col2 Col3 Type
0 7 8 9 2
1 13 14 15 3
...
20 1 2 3 1
21 10 11 12 2
...
45 4 5 6 1
46 16 17 18 3
...
我该如何实现?
I have the following DataFrame:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
...
20 7 8 9 2
21 10 11 12 2
...
45 13 14 15 3
46 16 17 18 3
...
The DataFrame is read from a csv file. All rows which have Type
1 are on top, followed by the rows with Type
2, followed by the rows with Type
3, etc.
I would like to shuffle the order of the DataFrame’s rows, so that all Type
‘s are mixed. A possible result could be:
Col1 Col2 Col3 Type
0 7 8 9 2
1 13 14 15 3
...
20 1 2 3 1
21 10 11 12 2
...
45 4 5 6 1
46 16 17 18 3
...
How can I achieve this?
回答 0
使用Pandas的惯用方式是使用.sample
数据框的方法对所有行进行采样而无需替换:
df.sample(frac=1)
的frac
关键字参数指定的行的分数到随机样品中返回,所以frac=1
装置返回所有行(随机顺序)。
注意:
如果您希望就地改组数据帧并重置索引,则可以执行例如
df = df.sample(frac=1).reset_index(drop=True)
在此,指定drop=True
可防止.reset_index
创建包含旧索引条目的列。
后续注解:尽管上面的操作似乎并不就位,但是python / pandas足够聪明,不会为经过改组的对象做另一个malloc。也就是说,即使参考对象已更改(我的意思id(df_old)
是与相同id(df_new)
),底层C对象仍然相同。为了证明确实如此,您可以运行一个简单的内存探查器:
$ python3 -m memory_profiler .\test.py
Filename: .\test.py
Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)
The idiomatic way to do this with Pandas is to use the .sample
method of your dataframe to sample all rows without replacement:
df.sample(frac=1)
The frac
keyword argument specifies the fraction of rows to return in the random sample, so frac=1
means return all rows (in random order).
Note:
If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.
df = df.sample(frac=1).reset_index(drop=True)
Here, specifying drop=True
prevents .reset_index
from creating a column containing the old index entries.
Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old)
is not the same as id(df_new)
), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:
$ python3 -m memory_profiler .\test.py
Filename: .\test.py
Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)
回答 1
您可以为此简单地使用sklearn
from sklearn.utils import shuffle
df = shuffle(df)
You can simply use sklearn for this
from sklearn.utils import shuffle
df = shuffle(df)
回答 2
您可以通过使用改组后的索引建立索引来改组数据帧的行。为此,您可以使用np.random.permutation
(但np.random.choice
也可以):
In [12]: df = pd.read_csv(StringIO(s), sep="\s+")
In [13]: df
Out[13]:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
20 7 8 9 2
21 10 11 12 2
45 13 14 15 3
46 16 17 18 3
In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]:
Col1 Col2 Col3 Type
46 16 17 18 3
45 13 14 15 3
20 7 8 9 2
0 1 2 3 1
1 4 5 6 1
21 10 11 12 2
如果要像示例中那样将索引的编号始终保持为1、2,..,n,则只需重置索引即可: df_shuffled.reset_index(drop=True)
You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use np.random.permutation
(but np.random.choice
is also a possibility):
In [12]: df = pd.read_csv(StringIO(s), sep="\s+")
In [13]: df
Out[13]:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
20 7 8 9 2
21 10 11 12 2
45 13 14 15 3
46 16 17 18 3
In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]:
Col1 Col2 Col3 Type
46 16 17 18 3
45 13 14 15 3
20 7 8 9 2
0 1 2 3 1
1 4 5 6 1
21 10 11 12 2
If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)
回答 3
TL; DR:np.random.shuffle(ndarray)
可以胜任。
所以,在你的情况下
np.random.shuffle(DataFrame.values)
DataFrame
在后台,使用NumPy ndarray作为数据持有者。(您可以从DataFrame源代码检查)
因此,如果使用np.random.shuffle()
,它将沿多维数组的第一个轴随机排列数组。但是DataFrame
遗体的索引仍然没有改组。
虽然,有一些要考虑的问题。
基准结果
在sklearn.utils.shuffle()
和之间np.random.shuffle()
。
ndarray
nd = sklearn.utils.shuffle(nd)
0.10793248389381915秒 快8倍
np.random.shuffle(nd)
0.8897626010002568秒
数据框
df = sklearn.utils.shuffle(df)
0.3183923360193148秒 快3倍
np.random.shuffle(df.values)
0.9357550159329548秒
结论:如果可以将轴信息(索引,列)与ndarray一起改组,请使用sklearn.utils.shuffle()
。否则,使用np.random.shuffle()
使用的代码
import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''
timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)
Python基准测试
TL;DR: np.random.shuffle(ndarray)
can do the job.
So, in your case
np.random.shuffle(DataFrame.values)
DataFrame
, under the hood, uses NumPy ndarray as data holder. (You can check from DataFrame source code)
So if you use np.random.shuffle()
, it would shuffles the array along the first axis of a multi-dimensional array. But index of the DataFrame
remains unshuffled.
Though, there are some points to consider.
- function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
sklearn.utils.shuffle()
, as user tj89 suggested, can designate random_state
along with another option to control output. You may want that for dev purpose.
sklearn.utils.shuffle()
is faster. But WILL SHUFFLE the axis info(index, column) of the DataFrame
along with the ndarray
it contains.
Benchmark result
between sklearn.utils.shuffle()
and np.random.shuffle()
.
ndarray
nd = sklearn.utils.shuffle(nd)
0.10793248389381915 sec. 8x faster
np.random.shuffle(nd)
0.8897626010002568 sec
DataFrame
df = sklearn.utils.shuffle(df)
0.3183923360193148 sec. 3x faster
np.random.shuffle(df.values)
0.9357550159329548 sec
Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use sklearn.utils.shuffle()
. Otherwise, use np.random.shuffle()
used code
import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''
timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)
pythonbenchmarking
回答 4
(我没有足够的声誉在最高职位上对此发表评论,所以我希望其他人可以为我这样做。)第一种方法引起了人们的关注:
df.sample(frac=1)
进行深拷贝或只是更改数据框。我运行了以下代码:
print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))
我的结果是:
0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70
这意味着该方法未返回上一个注释中建议的相同对象。因此,此方法的确可以制作随机的副本。
(I don’t have enough reputation to comment this on the top post, so I hope someone else can do that for me.) There was a concern raised that the first method:
df.sample(frac=1)
made a deep copy or just changed the dataframe. I ran the following code:
print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))
and my results were:
0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70
which means the method is not returning the same object, as was suggested in the last comment. So this method does indeed make a shuffled copy.
回答 5
还有用的是,如果将其用于Machine_learning并且希望始终分离相同的数据,则可以使用:
df.sample(n=len(df), random_state=42)
这样可以确保您的随机选择始终可复制
What is also useful, if you use it for Machine_learning and want to seperate always the same data, you could use:
df.sample(n=len(df), random_state=42)
this makes sure, that you keep your random choice always replicatable
回答 6
AFAIK最简单的解决方案是:
df_shuffled = df.reindex(np.random.permutation(df.index))
AFAIK the simplest solution is:
df_shuffled = df.reindex(np.random.permutation(df.index))
回答 7
通过取样阵列中的这种情况下,洗牌大熊猫数据帧索引和随机那么它的顺序来设置所述阵列的数据帧的索引。现在根据索引对数据帧进行排序。这是您经过改组的数据框
import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()
输出
a b
0 2 6
1 1 5
2 3 7
3 4 8
在上面的代码中将数据框插入我的位置。
shuffle the pandas data frame by taking a sample array in this case index and randomize its order then set the array as an index of data frame. Now sort the data frame according to index. Here goes your shuffled dataframe
import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()
output
a b
0 2 6
1 1 5
2 3 7
3 4 8
Insert you data frame in the place of mine in above code .
回答 8
这是另一种方式:
df['rnd'] = np.random.rand(len(df))
df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)
Here is another way:
df['rnd'] = np.random.rand(len(df))
df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)