## 问题：如何将一列分为两列？

``````          row
0    00000 UNITED STATES
1    01000 ALABAMA
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL
``````

``````         fips       row
0    00000 UNITED STATES
1    01000 ALABAMA
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL
``````

I have a data frame with one (string) column and I’d like to split it into two (string) columns, with one column header as ‘`fips'` and the other `'row'`

My dataframe `df` looks like this:

``````          row
0    00000 UNITED STATES
1    01000 ALABAMA
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL
``````

I do not know how to use `df.row.str[:]` to achieve my goal of splitting the row cell. I can use `df['fips'] = hello` to add a new column and populate it with `hello`. Any ideas?

``````         fips       row
0    00000 UNITED STATES
1    01000 ALABAMA
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL
``````

## 回答 0

``````                            row
0       00000 UNITED STATES
1             01000 ALABAMA
2  01001 Autauga County, AL
3  01003 Baldwin County, AL
4  01005 Barbour County, AL
``````
``````df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
columns = ['flips','row'])
``````
``````   flips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL
``````

There might be a better way, but this here’s one approach:

``````                            row
0       00000 UNITED STATES
1             01000 ALABAMA
2  01001 Autauga County, AL
3  01003 Baldwin County, AL
4  01005 Barbour County, AL
``````
``````df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
columns = ['flips','row'])
``````
``````   flips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL
``````

# TL; DR版本：

• 我有一个带有定界符的文本列，我想要两列

``df['A'], df['B'] = df['AB'].str.split(' ', 1).str``

``df['AB'].str.split(' ', 1, expand=True)``

`expand=True`如果字符串的分割数不一致，并且要`None`替换缺失的值，则必须使用。

# 详细地：

``````>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df

AB
0  A1-B1
1  A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df

AB  AB_split
0  A1-B1  [A1, B1]
1  A2-B2  [A2, B2]``````

1：如果不确定`.str.split()`do 的前两个参数是什么，我建议使用该方法纯Python版本的文档。

• 包含两个元素的列表的列

• 两列，每列包含列表的相应元素？

``````>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df

U
0  A
1  B
2  C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df

U  L
0  A  a
1  B  b
2  C  c``````

``````>>> df['AB'].str[0]

0    A
1    A
Name: AB, dtype: object

>>> df['AB'].str[1]

0    1
1    2
Name: AB, dtype: object``````

``````>>> df['AB'].str.split('-', 1).str[0]

0    A1
1    A2
Name: AB, dtype: object

>>> df['AB'].str.split('-', 1).str[1]

0    B1
1    B2
Name: AB, dtype: object``````

``````>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str
>>> df

AB  AB_split   A   B
0  A1-B1  [A1, B1]  A1  B1
1  A2-B2  [A2, B2]  A2  B2``````

``````>>> df['AB'].str.split('-', 1, expand=True)

0   1
0  A1  B1
1  A2  B2``````

``````>>> df = df[['AB']]
>>> df

AB
0  A1-B1
1  A2-B2

>>> df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))

AB   A   B
0  A1-B1  A1  B1
1  A2-B2  A2  B2``````

`expand=True`版本虽然较长，但与元组拆包方法相比具有明显的优势。元组解压缩不能很好地处理不同长度的拆分：

``````>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})
>>> df
AB
0     A1-B1
1     A2-B2
2  A3-B3-C3
>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')
Traceback (most recent call last):
[...]
ValueError: Length of values does not match length of index
>>> ``````

``````>>> df.join(
...     df['AB'].str.split('-', expand=True).rename(
...         columns={0:'A', 1:'B', 2:'C'}
...     )
... )
AB   A   B     C
0     A1-B1  A1  B1  None
1     A2-B2  A2  B2  None
2  A3-B3-C3  A3  B3    C3``````

# TL;DR version:

For the simple case of:

• I have a text column with a delimiter and I want two columns

The simplest solution is:

``````df[['A', 'B']] = df['AB'].str.split(' ', 1, expand=True)
``````

You must use `expand=True` if your strings have a non-uniform number of splits and you want `None` to replace the missing values.

Notice how, in either case, the `.tolist()` method is not necessary. Neither is `zip()`.

# In detail:

Andy Hayden’s solution is most excellent in demonstrating the power of the `str.extract()` method.

But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the `.str.split()` method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:

``````>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df

AB
0  A1-B1
1  A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df

AB  AB_split
0  A1-B1  [A1, B1]
1  A2-B2  [A2, B2]
``````

1: If you’re unsure what the first two parameters of `.str.split()` do, I recommend the docs for the plain Python version of the method.

But how do you go from:

• a column containing two-element lists

to:

• two columns, each containing the respective element of the lists?

Well, we need to take a closer look at the `.str` attribute of a column.

It’s a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

``````>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df

U
0  A
1  B
2  C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df

U  L
0  A  a
1  B  b
2  C  c
``````

But it also has an “indexing” interface for getting each element of a string by its index:

``````>>> df['AB'].str[0]

0    A
1    A
Name: AB, dtype: object

>>> df['AB'].str[1]

0    1
1    2
Name: AB, dtype: object
``````

Of course, this indexing interface of `.str` doesn’t really care if each element it’s indexing is actually a string, as long as it can be indexed, so:

``````>>> df['AB'].str.split('-', 1).str[0]

0    A1
1    A2
Name: AB, dtype: object

>>> df['AB'].str.split('-', 1).str[1]

0    B1
1    B2
Name: AB, dtype: object
``````

Then, it’s a simple matter of taking advantage of the Python tuple unpacking of iterables to do

``````>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str
>>> df

AB  AB_split   A   B
0  A1-B1  [A1, B1]  A1  B1
1  A2-B2  [A2, B2]  A2  B2
``````

Of course, getting a DataFrame out of splitting a column of strings is so useful that the `.str.split()` method can do it for you with the `expand=True` parameter:

``````>>> df['AB'].str.split('-', 1, expand=True)

0   1
0  A1  B1
1  A2  B2
``````

So, another way of accomplishing what we wanted is to do:

``````>>> df = df[['AB']]
>>> df

AB
0  A1-B1
1  A2-B2

>>> df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))

AB   A   B
0  A1-B1  A1  B1
1  A2-B2  A2  B2
``````

The `expand=True` version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn’t deal well with splits of different lengths:

``````>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})
>>> df
AB
0     A1-B1
1     A2-B2
2  A3-B3-C3
>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')
Traceback (most recent call last):
[...]
ValueError: Length of values does not match length of index
>>>
``````

But `expand=True` handles it nicely by placing `None` in the columns for which there aren’t enough “splits”:

``````>>> df.join(
...     df['AB'].str.split('-', expand=True).rename(
...         columns={0:'A', 1:'B', 2:'C'}
...     )
... )
AB   A   B     C
0     A1-B1  A1  B1  None
1     A2-B2  A2  B2  None
2  A3-B3-C3  A3  B3    C3
``````

## 回答 2

``````In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*\$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}\$))')
Out[11]:
fips                    1           state           county state_code
0  00000        UNITED STATES   UNITED STATES              NaN        NaN
1  01000              ALABAMA         ALABAMA              NaN        NaN
2  01001   Autauga County, AL             NaN   Autauga County         AL
3  01003   Baldwin County, AL             NaN   Baldwin County         AL
4  01005   Barbour County, AL             NaN   Barbour County         AL

[5 rows x 5 columns]``````

``(?P<fips>\d{5})``
• 匹配五个数字（`\d`）并命名`"fips"`

``((?P<state>[A-Z ]*\$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}\$))``

`|`）做以下两件事之一：

``(?P<state>[A-Z ]*\$)``
• 匹配任意数量（`*`）的大写字母或空格（`[A-Z ]`），并`"state"`在字符串（`\$`）末尾之前对其进行命名，

``(?P<county>.*?), (?P<state_code>[A-Z]{2}\$))``
• `.*`然后匹配其他任何（）
• 然后用逗号和空格
• 匹配`state_code`字符串（`\$`）末尾的两位数字。

You can extract the different parts out quite neatly using a regex pattern:

``````In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*\$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}\$))')
Out[11]:
fips                    1           state           county state_code
0  00000        UNITED STATES   UNITED STATES              NaN        NaN
1  01000              ALABAMA         ALABAMA              NaN        NaN
2  01001   Autauga County, AL             NaN   Autauga County         AL
3  01003   Baldwin County, AL             NaN   Baldwin County         AL
4  01005   Barbour County, AL             NaN   Barbour County         AL

[5 rows x 5 columns]
``````

To explain the somewhat long regex:

``````(?P<fips>\d{5})
``````
• Matches the five digits (`\d`) and names them `"fips"`.

The next part:

``````((?P<state>[A-Z ]*\$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}\$))
``````

Does either (`|`) one of two things:

``````(?P<state>[A-Z ]*\$)
``````
• Matches any number (`*`) of capital letters or spaces (`[A-Z ]`) and names this `"state"` before the end of the string (`\$`),

or

``````(?P<county>.*?), (?P<state_code>[A-Z]{2}\$))
``````
• matches anything else (`.*`) then
• a comma and a space then
• matches the two digit `state_code` before the end of the string (`\$`).

In the example:
Note that the first two rows hit the “state” (leaving NaN in the county and state_code columns), whilst the last three hit the county, state_code (leaving NaN in the state column).

## 回答 3

``df[['fips', 'row']] = df['row'].str.split(' ', n=1, expand=True)``
``````df[['fips', 'row']] = df['row'].str.split(' ', n=1, expand=True)
``````

## 回答 4

``````df["flips"], df["row_name"] = zip(*df["row"].str.split().tolist())
del df["row"]  ``````

If you don’t want to create a new dataframe, or if your dataframe has more columns than just the ones you want to split, you could:

``````df["flips"], df["row_name"] = zip(*df["row"].str.split().tolist())
del df["row"]
``````

## 回答 5

``````df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA',
'01001 Autauga County, AL', '01003 Baldwin County, AL',
'01005 Barbour County, AL']})
print (df)
row
0       00000 UNITED STATES
1             01000 ALABAMA
2  01001 Autauga County, AL
3  01003 Baldwin County, AL
4  01005 Barbour County, AL

df[['a','b']] = df['row'].str.split(n=1, expand=True)
print (df)
row      a                   b
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL``````

``````df[['a','b']] = df.pop('row').str.split(n=1, expand=True)
print (df)
a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL``````

``````df[['a','b']] = df['row'].str.split(n=1, expand=True)
df = df.drop('row', axis=1)
print (df)

a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL``````

``````#remove n=1 for split by all whitespaces
df[['a','b']] = df['row'].str.split(expand=True)``````

ValueError：列的长度必须与键的长度相同

``````print (df['row'].str.split(expand=True))
0        1        2     3
0  00000   UNITED   STATES  None
1  01000  ALABAMA     None  None
2  01001  Autauga  County,    AL
3  01003  Baldwin  County,    AL
4  01005  Barbour  County,    AL``````

``````df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA',
'01001 Autauga County, AL', '01003 Baldwin County, AL',
'01005 Barbour County, AL'],
'a':range(5)})
print (df)
a                       row
0  0       00000 UNITED STATES
1  1             01000 ALABAMA
2  2  01001 Autauga County, AL
3  3  01003 Baldwin County, AL
4  4  01005 Barbour County, AL

df = df.join(df['row'].str.split(expand=True))
print (df)

a                       row      0        1        2     3
0  0       00000 UNITED STATES  00000   UNITED   STATES  None
1  1             01000 ALABAMA  01000  ALABAMA     None  None
2  2  01001 Autauga County, AL  01001  Autauga  County,    AL
3  3  01003 Baldwin County, AL  01003  Baldwin  County,    AL
4  4  01005 Barbour County, AL  01005  Barbour  County,    AL``````

``````df = df.join(df.pop('row').str.split(expand=True))
print (df)
a      0        1        2     3
0  0  00000   UNITED   STATES  None
1  1  01000  ALABAMA     None  None
2  2  01001  Autauga  County,    AL
3  3  01003  Baldwin  County,    AL
4  4  01005  Barbour  County,    AL   ``````

You can use by whitespace (default separator) and parameter `expand=True` for `DataFrame` with assign to new columns:

``````df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA',
'01001 Autauga County, AL', '01003 Baldwin County, AL',
'01005 Barbour County, AL']})
print (df)
row
0       00000 UNITED STATES
1             01000 ALABAMA
2  01001 Autauga County, AL
3  01003 Baldwin County, AL
4  01005 Barbour County, AL

df[['a','b']] = df['row'].str.split(n=1, expand=True)
print (df)
row      a                   b
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL
``````

Modification if need remove original column with `DataFrame.pop`

``````df[['a','b']] = df.pop('row').str.split(n=1, expand=True)
print (df)
a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL
``````

What is same like:

``````df[['a','b']] = df['row'].str.split(n=1, expand=True)
df = df.drop('row', axis=1)
print (df)

a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL
``````

If get error:

``````#remove n=1 for split by all whitespaces
df[['a','b']] = df['row'].str.split(expand=True)
``````

ValueError: Columns must be same length as key

You can check and it return 4 column `DataFrame`, not only 2:

``````print (df['row'].str.split(expand=True))
0        1        2     3
0  00000   UNITED   STATES  None
1  01000  ALABAMA     None  None
2  01001  Autauga  County,    AL
3  01003  Baldwin  County,    AL
4  01005  Barbour  County,    AL
``````

Then solution is append new `DataFrame` by `join`:

``````df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA',
'01001 Autauga County, AL', '01003 Baldwin County, AL',
'01005 Barbour County, AL'],
'a':range(5)})
print (df)
a                       row
0  0       00000 UNITED STATES
1  1             01000 ALABAMA
2  2  01001 Autauga County, AL
3  3  01003 Baldwin County, AL
4  4  01005 Barbour County, AL

df = df.join(df['row'].str.split(expand=True))
print (df)

a                       row      0        1        2     3
0  0       00000 UNITED STATES  00000   UNITED   STATES  None
1  1             01000 ALABAMA  01000  ALABAMA     None  None
2  2  01001 Autauga County, AL  01001  Autauga  County,    AL
3  3  01003 Baldwin County, AL  01003  Baldwin  County,    AL
4  4  01005 Barbour County, AL  01005  Barbour  County,    AL
``````

With remove original column (if there are also another columns):

``````df = df.join(df.pop('row').str.split(expand=True))
print (df)
a      0        1        2     3
0  0  00000   UNITED   STATES  None
1  1  01000  ALABAMA     None  None
2  2  01001  Autauga  County,    AL
3  3  01003  Baldwin  County,    AL
4  4  01005  Barbour  County,    AL
``````

## 回答 6

``df['column_name'].str.split('/', expand=True)``

If you want to split a string into more than two columns based on a delimiter you can omit the ‘maximum splits’ parameter.
You can use:

``````df['column_name'].str.split('/', expand=True)
``````

This will automatically create as many columns as the maximum number of fields included in any of your initial strings.

# `Series.str.partition`

`partition` 在分隔符上执行一次拆分，通常表现出色。

``````df['row'].str.partition(' ')[[0, 2]]

0                   2
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL``````

``````df['row'].str.partition(' ')[[0, 2]].rename({0: 'fips', 2: 'row'}, axis=1)

fips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL``````

``df.join(df['row'].str.partition(' ')[[0, 2]])``

``````pd.concat([df, df['row'].str.partition(' ')[[0, 2]]], axis=1)

row      0                   2
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL``````

Surprised I haven’t seen this one yet. If you only need two splits, I highly recommend. . .

# `Series.str.partition`

`partition` performs one split on the separator, and is generally quite performant.

``````df['row'].str.partition(' ')[[0, 2]]

0                   2
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL
``````

If you need to rename the rows,

``````df['row'].str.partition(' ')[[0, 2]].rename({0: 'fips', 2: 'row'}, axis=1)

fips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL
``````

If you need to join this back to the original, use `join` or `concat`:

``````df.join(df['row'].str.partition(' ')[[0, 2]])
``````

``````pd.concat([df, df['row'].str.partition(' ')[[0, 2]]], axis=1)

row      0                   2
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL
``````

## 回答 8

`````` col1 = df["<col_name>"].apply(<function>)
col2 = ...
df = df.join(col1.to_frame(name="<name1>"))
df = df.join(col2.toframe(name="<name2>"))
df = df.drop(["<col_name>"], axis=1)``````

``````lambda x: x.split(" ")[0] # for the first element
lambda x: x.split(" ")[-1] # for the last element``````

I prefer exporting the corresponding pandas series (i.e. the columns I need), using the apply function to split the column content into multiple series and then join the generated columns to the existing DataFrame. Of course, the source column should be removed.

e.g.

`````` col1 = df["<col_name>"].apply(<function>)
col2 = ...
df = df.join(col1.to_frame(name="<name1>"))
df = df.join(col2.toframe(name="<name2>"))
df = df.drop(["<col_name>"], axis=1)
``````

To split two words strings function should be something like that:

``````lambda x: x.split(" ")[0] # for the first element
lambda x: x.split(" ")[-1] # for the last element
``````

## 回答 9

``````df["<col_name>"].str.slice(stop=5)
df["<col_name>"].str.slice(start=6)``````

I saw that no one had used the slice method, so here I put my 2 cents here.

``````df["<col_name>"].str.slice(stop=5)
df["<col_name>"].str.slice(start=6)
``````

This method will create two new columns.

## 回答 10

``````split = df_selected['name'].str.split(',', 1, expand=True)
df_split = df_selected.assign(first_name=split[0], last_name=split[1])
df_split.drop('name', 1, inplace=True)``````

Use `df.assign` to create a new df. See http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

``````split = df_selected['name'].str.split(',', 1, expand=True)
df_split = df_selected.assign(first_name=split[0], last_name=split[1])
df_split.drop('name', 1, inplace=True)
``````