如何在pyspark中将Dataframe列从String类型更改为Double类型

Question 1

I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark.

Following is the way, I did:

toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

Just wanted to know, is this the right way to do it as while running through Logistic Regression, I am getting some error, so I wonder, is this the reason for the trouble.

Question 2

There is no need for an UDF here. Column already provides cast method with DataType instance :

from pyspark.sql.types import DoubleType

changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))

or short string:

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

where canonical string names (other variations can be supported as well) correspond to simpleString value. So for atomic types:

from pyspark.sql import types 

for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 
          'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 
           'LongType', 'ShortType', 'StringType', 'TimestampType']:
    print(f"{t}: {getattr(types, t)().simpleString()}")

BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp

and for example complex types

types.ArrayType(types.IntegerType()).simpleString()

'array<int>'

types.MapType(types.StringType(), types.IntegerType()).simpleString()

'map<string,int>'

Question 3

Preserve the name of the column and avoid extra column addition by using the same name as input column:

changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))

Question 4

Given answers are enough to deal with the problem but I want to share another way which may be introduced the new version of Spark (I am not sure about it) so given answer didn’t catch it.

We can reach the column in spark statement with col("colum_name") keyword:

from pyspark.sql.functions import col , column
changedTypedf = joindf.withColumn("show", col("show").cast("double"))

Question 5

pyspark version:

  df = <source data>
  df.printSchema()

  from pyspark.sql.types import *

  # Change column type
  df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
  df_new.printSchema()
  df_new.select("myColumn").show()

Question 6

the solution was simple –

toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

如何在pyspark中将Dataframe列从String类型更改为Double类型

问题：如何在pyspark中将Dataframe列从String类型更改为Double类型

回答 0

回答 1

回答 2

回答 3

回答 4

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

如何从SQLAlchemy表达式获取原始的编译SQL查询？

在Python中最快的HTTP GET方法是什么？

如何在Python中实现常见的bash习惯用法？[关闭]

如何将熊猫数据框的索引转换为列？

在Windows 7上添加Python路径

在Python中查找列表的中位数

如何在pyspark中将Dataframe列从String类型更改为Double类型

问题：如何在pyspark中将Dataframe列从String类型更改为Double类型

回答 0

回答 1

回答 2

回答 3

回答 4

相关文章

排行榜展示

文章展示