问题:如何使用Python删除非ASCII字符但保留句点和空格?
我正在使用.txt文件。我希望文件中的文本字符串不包含非ASCII字符。但是,我想留空格和句点。目前,我也正在剥离它们。这是代码:
def onlyascii(char):
if ord(char) < 48 or ord(char) > 127: return ''
else: return char
def get_my_string(file_path):
f=open(file_path,'r')
data=f.read()
f.close()
filtered_data=filter(onlyascii, data)
filtered_data = filtered_data.lower()
return filtered_data
我应该如何修改onlyascii()以保留空格和句点?我想这并不太复杂,但我无法弄清楚。
I’m working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I’m stripping those too. Here’s the code:
def onlyascii(char):
if ord(char) < 48 or ord(char) > 127: return ''
else: return char
def get_my_string(file_path):
f=open(file_path,'r')
data=f.read()
f.close()
filtered_data=filter(onlyascii, data)
filtered_data = filtered_data.lower()
return filtered_data
How should I modify onlyascii() to leave spaces and periods? I imagine it’s not too complicated but I can’t figure it out.
回答 0
您可以使用string.printable过滤字符串中所有不可打印的字符,如下所示:
>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'
我机器上的string.printable包含:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c
编辑:在Python 3上,筛选器将返回可迭代。返回字符串的正确方法是:
''.join(filter(lambda x: x in printable, s))
You can filter all characters from the string that are not printable using string.printable, like this:
>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'
string.printable on my machine contains:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c
EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:
''.join(filter(lambda x: x in printable, s))
回答 1
更改为其他编解码器的简单方法是使用encode()或decode()。在您的情况下,您想转换为ASCII并忽略所有不支持的符号。例如,瑞典字母å不是ASCII字符:
>>>s = u'Good bye in Swedish is Hej d\xe5'
>>>s = s.encode('ascii',errors='ignore')
>>>print s
Good bye in Swedish is Hej d
编辑:
Python3:str->字节-> str
>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'
Python2:unicode-> str-> unicode
>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'
Python2:str-> unicode-> str(以相反的顺序解码和编码)
>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'
An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:
>>>s = u'Good bye in Swedish is Hej d\xe5'
>>>s = s.encode('ascii',errors='ignore')
>>>print s
Good bye in Swedish is Hej d
Edit:
Python3: str -> bytes -> str
>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'
Python2: unicode -> str -> unicode
>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'
Python2: str -> unicode -> str (decode and encode in reverse order)
>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'
回答 2
回答 3
您的问题不明确;前两个句子加在一起表示您认为空格和“句点”是非ASCII字符。这是不正确的。等于ord(char)<= 127的所有字符都是ASCII字符。例如,您的函数不包括这些字符!“#$%&\’()* +,-。/,但包括其他几个字符,例如[] {}。
请退后一步,三思而后行,然后编辑您的问题以告诉我们您要做什么,而无需提及ASCII单词,以及为什么您认为ord(char)> = 128这样的chars是可忽略的。另外:哪个版本的Python?输入数据的编码是什么?
请注意,您的代码将整个输入文件读取为单个字符串,并且您对另一个答案的注释(“最佳解决方案”)意味着您无需关心数据中的换行符。如果您的文件包含这样的两行:
this is line 1
this is line 2
结果将是'this is line 1this is line 2'
……您真正想要的是什么?
更好的解决方案包括:
- 过滤器功能比一个更好的名字
onlyascii
认识到如果要保留参数,则过滤器功能仅需要返回真实值:
def filter_func(char):
return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()
Your question is ambiguous; the first two sentences taken together imply that you believe that space and “period” are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !”#$%&\'()*+,-./ but includes several others e.g. []{}.
Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?
Please note that your code reads the whole input file as a single string, and your comment (“great solution”) to another answer implies that you don’t care about newlines in your data. If your file contains two lines like this:
this is line 1
this is line 2
the result would be 'this is line 1this is line 2'
… is that what you really want?
A greater solution would include:
- a better name for the filter function than
onlyascii
recognition that a filter function merely needs to return a truthy value if the argument is to be retained:
def filter_func(char):
return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()
回答 4
您可以使用以下代码删除非英语字母:
import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)
这将返回
123456790 ABC#%?。()
You may use the following code to remove non-English letters:
import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)
This will return
123456790 ABC#%? .()
回答 5
如果您需要可打印的ASCII字符,则可能应将代码更正为:
if ord(char) < 32 or ord(char) > 126: return ''
等同于string.printable
(@jterrace的答案),除了没有返回和制表符(’\ t’,’\ n’,’\ x0b’,’\ x0c’和’\ r’),但不对应您问题的范围
If you want printable ascii characters you probably should correct your code to:
if ord(char) < 32 or ord(char) > 126: return ''
this is equivalent, to string.printable
(answer from @jterrace), except for the absence of returns and tabs (‘\t’,’\n’,’\x0b’,’\x0c’ and ‘\r’) but doesnt correspond to the range on your question
回答 6
我强烈推荐使用Fluent Python(Ramalho)。列出受第二章启发的单线Class理解:
onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])
Working my way through Fluent Python (Ramalho) – highly recommended.
List comprehension one-ish-liners inspired by Chapter 2:
onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])