问题:从字符串中删除所有特殊字符,标点和空格

我需要从字符串中删除所有特殊字符,标点符号和空格,以便只有字母和数字。

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.


回答 0

这可以不用正则表达式来完成:

>>> string = "Special $#! characters   spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'

您可以使用str.isalnum

S.isalnum() -> bool

Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.

如果您坚持使用正则表达式,则其他解决方案也可以。但是请注意,如果可以在不使用正则表达式的情况下完成此操作,那么这是最好的解决方法。

This can be done without regex:

>>> string = "Special $#! characters   spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'

You can use str.isalnum:

S.isalnum() -> bool

Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.

If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that’s the best way to go about it.


回答 1

这是一个正则表达式,用于匹配不是字母或数字的字符串:

[^A-Za-z0-9]+

这是执行正则表达式替换的Python命令:

re.sub('[^A-Za-z0-9]+', '', mystring)

Here is a regex to match a string of characters that are not a letters or numbers:

[^A-Za-z0-9]+

Here is the Python command to do a regex substitution:

re.sub('[^A-Za-z0-9]+', '', mystring)

回答 2

较短的方法:

import re
cleanString = re.sub('\W+','', string )

如果要在单词和数字之间留空格,请用”代替’

Shorter way :

import re
cleanString = re.sub('\W+','', string )

If you want spaces between words and numbers substitute ” with ‘ ‘


回答 3

看到这一点之后,我有兴趣通过找出执行时间最短的方法来扩展所提供的答案,因此我仔细检查了一些建议的答案,并timeit对照了两个示例字符串:

  • string1 = 'Special $#! characters spaces 888323'
  • string2 = 'how much for the maple syrup? $20.99? That s ricidulous!!!'

例子1

'.join(e for e in string if e.isalnum())

  • string1 -结果:10.7061979771
  • string2 -结果:7.77832597694

例子2

import re re.sub('[^A-Za-z0-9]+', '', string)

  • string1 -结果:7.10785102844
  • string2 -结果:4.12814903259

例子3

import re re.sub('\W+','', string)

  • string1 -结果:3.11899876595
  • string2 -结果:2.78014397621

以上结果是以下平均值的最低返回结果的乘积: repeat(3, 2000000)

示例3的速度可以比示例13倍。

After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:

  • string1 = 'Special $#! characters spaces 888323'
  • string2 = 'how much for the maple syrup? $20.99? That s ricidulous!!!'

Example 1

'.join(e for e in string if e.isalnum())

  • string1 – Result: 10.7061979771
  • string2 – Result: 7.78372597694

Example 2

import re re.sub('[^A-Za-z0-9]+', '', string)

  • string1 – Result: 7.10785102844
  • string2 – Result: 4.12814903259

Example 3

import re re.sub('\W+','', string)

  • string1 – Result: 3.11899876595
  • string2 – Result: 2.78014397621

The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)

Example 3 can be 3x faster than Example 1.


回答 4

Python 2. *

我认为filter(str.isalnum, string)效果很好

In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'

Python 3. *

在Python3中,filter( )函数将返回一个可迭代的对象(而不是上面的字符串)。必须重新加入以从itertable中获取字符串:

''.join(filter(str.isalnum, string)) 

或通过list加入使用(不确定,但可以很快

''.join([*filter(str.isalnum, string)])

注意:[*args]Python> = 3.5中解压缩有效

Python 2.*

I think just filter(str.isalnum, string) works

In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'

Python 3.*

In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:

''.join(filter(str.isalnum, string)) 

or to pass list in join use (not sure but can be fast a bit)

''.join([*filter(str.isalnum, string)])

note: unpacking in [*args] valid from Python >= 3.5


回答 5

#!/usr/bin/python
import re

strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr

您可以添加更多特殊字符,然后将其替换为“”,则表示没有任何意义,即它们将被删除。

#!/usr/bin/python
import re

strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr

you can add more special character and that will be replaced by ” means nothing i.e they will be removed.


回答 6

与使用正则表达式的其他所有人不同,我将尝试排除想要的每个字符,而不是明确枚举不需要的字符。

例如,如果我只需要’a到z’字符(大写和小写)和数字,我将排除所有其他内容:

import re
s = re.sub(r"[^a-zA-Z0-9]","",s)

这意味着“用空字符串替换每个不是数字的字符,或者用’a到z’或’A到Z’范围内的字符代替”。

实际上,如果^在正则表达式的第一位插入特殊字符,则会得到否定。

额外提示:如果您还需要将结果小写,则可以使正则表达式更快,更轻松,只要您现在找不到大写即可。

import re
s = re.sub(r"[^a-z0-9]","",s.lower())

Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don’t want.

For example, if I want only characters from ‘a to z’ (upper and lower case) and numbers, I would exclude everything else:

import re
s = re.sub(r"[^a-zA-Z0-9]","",s)

This means “substitute every character that is not a number, or a character in the range ‘a to z’ or ‘A to Z’ with an empty string”.

In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.

Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won’t find any uppercase now.

import re
s = re.sub(r"[^a-z0-9]","",s.lower())

回答 7

假设您要使用正则表达式,并且想要/需要支持2to3的Unicode识别2.x代码:

>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>

Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:

>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>

回答 8

s = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s)
s = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s)

回答 9

最通用的方法是使用unicodedata表的“类别”,该表对每个单个字符进行分类。例如,以下代码根据其类别仅过滤可打印字符:

import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien

PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))

def filter_non_printable(s):
    result = []
    ws_last = False
    for c in s:
        c = unicodedata.category(c) in PRINTABLE and c or u'#'
        result.append(c)
    return u''.join(result).replace(u'#', u' ')

查看上面所有相关类别的给定URL。当然,您也可以按标点符号类别进行过滤。

The most generic approach is using the ‘categories’ of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:

import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien

PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))

def filter_non_printable(s):
    result = []
    ws_last = False
    for c in s:
        c = unicodedata.category(c) in PRINTABLE and c or u'#'
        result.append(c)
    return u''.join(result).replace(u'#', u' ')

Look at the given URL above for all related categories. You also can of course filter by the punctuation categories.


回答 10

string。标点符号包含以下字符:

‘!“#$%&\’()* +,-。/ :; <=>?@ [\] ^ _`{|}〜’

您可以使用translate和maketrans函数将标点符号映射到空值(替换)

import string

'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))

输出:

'This is A test'

string.punctuation contains following characters:

‘!”#$%&\'()*+,-./:;<=>?@[\]^_`{|}~’

You can use translate and maketrans functions to map punctuations to empty values (replace)

import string

'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))

Output:

'This is A test'

回答 11

使用翻译:

import string

def clean(instr):
    return instr.translate(None, string.punctuation + ' ')

警告:仅适用于ASCII字符串。

Use translate:

import string

def clean(instr):
    return instr.translate(None, string.punctuation + ' ')

Caveat: Only works on ascii strings.


回答 12

import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the 

与双引号相同。“”“

# if we need to count the word python that ends with or without ',' or '.' at end

count = 0
for i in text:
    if i.endswith("."):
        text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
    count += 1
print("The count of Python : ", text.count("python"))
import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the 

same as double quotes.”””

# if we need to count the word python that ends with or without ',' or '.' at end

count = 0
for i in text:
    if i.endswith("."):
        text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
    count += 1
print("The count of Python : ", text.count("python"))

回答 13

import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)

您将看到的结果为

‘askhnlaskdjalsdk

import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)

and you shall see your result as

‘askhnlaskdjalsdk


回答 14

删除标点,数字和特殊字符

例子:-

在此处输入图片说明

combi['tidy_tweet'] = combi['tidy_tweet'].str.replace("[^a-zA-Z#]", " ") 

结果:- 在此处输入图片说明

谢谢 :)

Removing Punctuations, Numbers, and Special Characters

Example :-

enter image description here

Code

combi['tidy_tweet'] = combi['tidy_tweet'].str.replace("[^a-zA-Z#]", " ") 

Result:- enter image description here

Thanks :)


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。