Scikit学习中的随机状态(伪随机数)

问题:Scikit学习中的随机状态(伪随机数)

我想在scikit learning中实现机器学习算法,但我不明白此参数的random_state作用?我为什么要使用它?

我也无法理解什么是伪随机数。

I want to implement a machine learning algorithm in scikit learn, but I don’t understand what this parameter random_state does? Why should I use it?

I also could not understand what is a Pseudo-random number.


回答 0

train_test_split将数组或矩阵拆分为随机训练和测试子集。这意味着,每次运行时不指定random_state,您都会得到不同的结果,这是预期的行为。例如:

运行1:

>>> a, b = np.arange(10).reshape((5, 2)), range(5)
>>> train_test_split(a, b)
[array([[6, 7],
        [8, 9],
        [4, 5]]),
 array([[2, 3],
        [0, 1]]), [3, 4, 2], [1, 0]]

运行2

>>> train_test_split(a, b)
[array([[8, 9],
        [4, 5],
        [0, 1]]),
 array([[6, 7],
        [2, 3]]), [4, 2, 0], [3, 1]]

它改变。另一方面,如果使用random_state=some_number,则可以保证运行1的输出与运行2的输出相等,即,拆分将始终相同。实际的random_state数字是42,0,21,…无关紧要。重要的是,每次使用42时,第一次进行拆分时总会得到相同的输出。如果您想要可重现的结果(例如在文档中),这将很有用,这样每个人在运行示例时都可以始终看到相同的数字。实际上,我会说,random_state在测试材料时,应将设置为某个固定数字,但如果确实需要随机(而不是固定)分割,则应在生产中将其删除。

关于第二个问题,伪随机数生成器是一个生成几乎真正随机数的数字生成器。为什么它们不是真正随机的,超出了这个问题的范围,并且可能对您而言无关紧要,您可以在此处查看更多详细信息。

train_test_split splits arrays or matrices into random train and test subsets. That means that everytime you run it without specifying random_state, you will get a different result, this is expected behavior. For example:

Run 1:

>>> a, b = np.arange(10).reshape((5, 2)), range(5)
>>> train_test_split(a, b)
[array([[6, 7],
        [8, 9],
        [4, 5]]),
 array([[2, 3],
        [0, 1]]), [3, 4, 2], [1, 0]]

Run 2

>>> train_test_split(a, b)
[array([[8, 9],
        [4, 5],
        [0, 1]]),
 array([[6, 7],
        [2, 3]]), [4, 2, 0], [3, 1]]

It changes. On the other hand if you use random_state=some_number, then you can guarantee that the output of Run 1 will be equal to the output of Run 2, i.e. your split will be always the same. It doesn’t matter what the actual random_state number is 42, 0, 21, … The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results, for example in the documentation, so that everybody can consistently see the same numbers when they run the examples. In practice I would say, you should set the random_state to some fixed number while you test stuff, but then remove it in production if you really need a random (and not a fixed) split.

Regarding your second question, a pseudo-random number generator is a number generator that generates almost truly random numbers. Why they are not truly random is out of the scope of this question and probably won’t matter in your case, you can take a look here form more details.


回答 1

如果未random_state在代码中指定,则每次运行(执行)代码时,都会生成一个新的随机值,并且训练和测试数据集每次将具有不同的值。

但是,如果像这样分配一个固定值,则random_state = 42无论您执行了多少次代码,结果都将相同,即训练和测试数据集中的值相同。

If you don’t specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.

However, if a fixed value is assigned like random_state = 42 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.


回答 2

如果您在代码中未提及random_state,则每次执行代码时都会生成一个新的随机值,并且训练和测试数据集每次都将具有不同的值。

但是,如果每次将特定值用于random_state(random_state = 1或任何其他值),则结果将相同,即训练和测试数据集中的值相同。请参考以下代码:

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,random_state = 1,test_size = .3)
size25split = train_test_split(test_series,random_state = 1,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

不管运行代码多少次,输出都是70。

70

尝试删除random_state并运行代码。

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,test_size = .3)
size25split = train_test_split(test_series,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

现在,每次执行代码时,输​​出将有所不同。

If you don’t mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.

However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets. Refer below code:

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,random_state = 1,test_size = .3)
size25split = train_test_split(test_series,random_state = 1,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

Doesn’t matter how many times you run the code, the output will be 70.

70

Try to remove the random_state and run the code.

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,test_size = .3)
size25split = train_test_split(test_series,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

Now here output will be different each time you execute the code.


回答 3

random_state数字以随机方式拆分测试和训练数据集。除了此处要说明的内容外,还必须记住,random_state值可能会对模型的质量产生重大影响(按质量,我实质上是指预测的准确性)。例如,如果您采用某个数据集并使用其训练回归模型,而未指定random_state值,则有可能每次都会在测试数据上为训练后的模型获得不同的准确性结果。因此,找到最佳的random_state值以为您提供最准确的模型很重要。然后,该数字将用于在另一个场合(例如另一个研究实验)重现您的模型。为此,

for j in range(1000):

            X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =j,     test_size=0.35)
            lr = LarsCV().fit(X_train, y_train)

            tr_score.append(lr.score(X_train, y_train))
            ts_score.append(lr.score(X_test, y_test))

        J = ts_score.index(np.max(ts_score))

        X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =J, test_size=0.35)
        M = LarsCV().fit(X_train, y_train)
        y_pred = M.predict(X_test)`

random_state number splits the test and training datasets with a random manner. In addition to what is explained here, it is important to remember that random_state value can have significant effect on the quality of your model (by quality I essentially mean accuracy to predict). For instance, If you take a certain dataset and train a regression model with it, without specifying the random_state value, there is the potential that everytime, you will get a different accuracy result for your trained model on the test data. So it is important to find the best random_state value to provide you with the most accurate model. And then, that number will be used to reproduce your model in another occasion such as another research experiment. To do so, it is possible to split and train the model in a for-loop by assigning random numbers to random_state parameter:

for j in range(1000):

            X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =j,     test_size=0.35)
            lr = LarsCV().fit(X_train, y_train)

            tr_score.append(lr.score(X_train, y_train))
            ts_score.append(lr.score(X_test, y_test))

        J = ts_score.index(np.max(ts_score))

        X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =J, test_size=0.35)
        M = LarsCV().fit(X_train, y_train)
        y_pred = M.predict(X_test)`


回答 4

如果没有提供任何randomstate,系统将使用内部生成的randomstate。因此,当您多次运行该程序时,您可能会看到不同的训练/测试数据点,并且行为将不可预测。万一您的模型有问题,您将无法重新创建它,因为您不知道运行程序时生成的随机数。

如果您看到树分类器-DT或RF,它们会尝试使用最佳计划进行尝试。尽管大多数时候该计划可能是相同的,但是在某些情况下树可能会有所不同,因此预测也是如此。当您尝试调试模型时,可能无法重新创建为其构建Tree的实例。因此,为了避免所有这些麻烦,我们在构建DecisionTreeClassifier或RandomForestClassifier时使用了random_state。

PS:您可以深入了解如何在DecisionTree中构建Tree,以更好地理解这一点。

randomstate基本上用于在每次运行时均重现您的问题。如果您不在traintestsplit中使用randomstate,则每次进行拆分时,您可能会得到一组不同的train和test数据点,并且在遇到问题时将无助于调​​试。

从文档:

如果为int,则randomstate是随机数生成器使用的种子;否则为false。如果是RandomState实例,则randomstate是随机数生成器;如果为None,则随机数生成器是np.random使用的RandomState实例。

If there is no randomstate provided the system will use a randomstate that is generated internally. So, when you run the program multiple times you might see different train/test data points and the behavior will be unpredictable. In case, you have an issue with your model you will not be able to recreate it as you do not know the random number that was generated when you ran the program.

If you see the Tree Classifiers – either DT or RF, they try to build a try using an optimal plan. Though most of the times this plan might be the same there could be instances where the tree might be different and so the predictions. When you try to debug your model you may not be able to recreate the same instance for which a Tree was built. So, to avoid all this hassle we use a random_state while building a DecisionTreeClassifier or RandomForestClassifier.

PS: You can go a bit in depth on how the Tree is built in DecisionTree to understand this better.

randomstate is basically used for reproducing your problem the same every time it is run. If you do not use a randomstate in traintestsplit, every time you make the split you might get a different set of train and test data points and will not help you in debugging in case you get an issue.

From Doc:

If int, randomstate is the seed used by the random number generator; If RandomState instance, randomstate is the random number generator; If None, the random number generator is the RandomState instance used by np.random.


回答 5

sklearn.model_selection.train_test_split(*arrays, **options)[source]

将数组或矩阵拆分为随机训练和测试子集

Parameters: ... 
    random_state : int, RandomState instance or None, optional (default=None)

如果为int,则random_state是随机数生成器使用的种子;否则为false。如果是RandomState实例,则random_state是随机数生成器;如果为None,则随机数生成器是np.random使用的RandomState实例。来源:http : //scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

关于随机状态,在sklearn中许多随机算法中使用它来确定传递给伪随机数生成器的随机种子。因此,它不支配算法行为的任何方面。结果,在验证集中表现良好的随机状态值与在新的,看不见的测试集中表现良好的随机状态值不对应。确实,根据算法的不同,您可能仅通过更改训练样本的顺序即可看到完全不同的结果。”’来源:https : //stats.stackexchange.com/questions/263999/is-random-state-a-parameter -调

sklearn.model_selection.train_test_split(*arrays, **options)[source]

Split arrays or matrices into random train and test subsets

Parameters: ... 
    random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. source: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

”’Regarding the random state, it is used in many randomized algorithms in sklearn to determine the random seed passed to the pseudo-random number generator. Therefore, it does not govern any aspect of the algorithm’s behavior. As a consequence, random state values which performed well in the validation set do not correspond to those which would perform well in a new, unseen test set. Indeed, depending on the algorithm, you might see completely different results by just changing the ordering of training samples.”’ source: https://stats.stackexchange.com/questions/263999/is-random-state-a-parameter-to-tune