如何使用Python删除非ASCII字符但保留句点和空格？-Python 实用宝典

问题：如何使用Python删除非ASCII字符但保留句点和空格？

我正在使用.txt文件。我希望文件中的文本字符串不包含非ASCII字符。但是，我想留空格和句点。目前，我也正在剥离它们。这是代码：

def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data

我应该如何修改onlyascii（）以保留空格和句点？我想这并不太复杂，但我无法弄清楚。

I’m working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I’m stripping those too. Here’s the code:

def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data

How should I modify onlyascii() to leave spaces and periods? I imagine it’s not too complicated but I can’t figure it out.

回答 0

您可以使用string.printable过滤字符串中所有不可打印的字符，如下所示：

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

我机器上的string.printable包含：

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

编辑：在Python 3上，筛选器将返回可迭代。返回字符串的正确方法是：

''.join(filter(lambda x: x in printable, s))

You can filter all characters from the string that are not printable using string.printable, like this:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

string.printable on my machine contains:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

''.join(filter(lambda x: x in printable, s))

回答 1

更改为其他编解码器的简单方法是使用encode（）或decode（）。在您的情况下，您想转换为ASCII并忽略所有不支持的符号。例如，瑞典字母å不是ASCII字符：

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

编辑：

Python3：str->字节-> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2：unicode-> str-> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2：str-> unicode-> str（以相反的顺序解码和编码）

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

Edit:

Python3: str -> bytes -> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2: unicode -> str -> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2: str -> unicode -> str (decode and encode in reverse order)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

回答 2

根据@artfulrobot，这应该比filter和lambda更快：

re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)

According to @artfulrobot, this should be faster than filter and lambda:

re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)

See more examples here Replace non-ASCII characters with a single space

回答 3

您的问题不明确；前两个句子加在一起表示您认为空格和“句点”是非ASCII字符。这是不正确的。等于ord（char）<= 127的所有字符都是ASCII字符。例如，您的函数不包括这些字符！“＃$％＆\’（）* +，-。/，但包括其他几个字符，例如[] {}。

请退后一步，三思而后行，然后编辑您的问题以告诉我们您要做什么，而无需提及ASCII单词，以及为什么您认为ord（char）> = 128这样的chars是可忽略的。另外：哪个版本的Python？输入数据的编码是什么？

请注意，您的代码将整个输入文件读取为单个字符串，并且您对另一个答案的注释（“最佳解决方案”）意味着您无需关心数据中的换行符。如果您的文件包含这样的两行：

this is line 1
this is line 2

结果将是'this is line 1this is line 2'……您真正想要的是什么？

更好的解决方案包括：

过滤器功能比一个更好的名字 onlyascii

认识到如果要保留参数，则过滤器功能仅需要返回真实值：

def filter_func(char):
    return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()

Your question is ambiguous; the first two sentences taken together imply that you believe that space and “period” are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !”#$%&\'()*+,-./ but includes several others e.g. []{}.

Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?

Please note that your code reads the whole input file as a single string, and your comment (“great solution”) to another answer implies that you don’t care about newlines in your data. If your file contains two lines like this:

this is line 1
this is line 2

the result would be 'this is line 1this is line 2' … is that what you really want?

A greater solution would include:

a better name for the filter function than onlyascii

recognition that a filter function merely needs to return a truthy value if the argument is to be retained:

def filter_func(char):
    return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()

回答 4

您可以使用以下代码删除非英语字母：

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

这将返回

123456790 ABC＃％？。（）

You may use the following code to remove non-English letters:

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

This will return

123456790 ABC#%? .()

回答 5

如果您需要可打印的ASCII字符，则可能应将代码更正为：

if ord(char) < 32 or ord(char) > 126: return ''

等同于string.printable（@jterrace的答案），除了没有返回和制表符（’\ t’，’\ n’，’\ x0b’，’\ x0c’和’\ r’），但不对应您问题的范围

If you want printable ascii characters you probably should correct your code to:

if ord(char) < 32 or ord(char) > 126: return ''

this is equivalent, to string.printable (answer from @jterrace), except for the absence of returns and tabs (‘\t’,’\n’,’\x0b’,’\x0c’ and ‘\r’) but doesnt correspond to the range on your question

回答 6

我强烈推荐使用Fluent Python（Ramalho）。列出受第二章启发的单线Class理解：

onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
              'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

Working my way through Fluent Python (Ramalho) – highly recommended. List comprehension one-ish-liners inspired by Chapter 2:

onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
              'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

如何使用Python删除非ASCII字符但保留句点和空格？

问题：如何使用Python删除非ASCII字符但保留句点和空格？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

Python 流程图 — 一键转化代码为流程图

7行代码 Python热力图可视化分析缺失数据处理

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

如何在Python中创建一个简单的消息框？

Flatbuffers-FlatBuffers：内存效率高的串行化库

Python 使用tablib库快速导出数据

Python线程中的join（）有什么用？

python异常消息捕获

熊猫在每个组中获得最高的n条记录

如何使用Python删除非ASCII字符但保留句点和空格？

问题：如何使用Python删除非ASCII字符但保留句点和空格？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

相关文章

排行榜展示

文章展示