标签归档:diacritics

移除Python unicode字符串中的重音符号的最佳方法是什么?

问题:移除Python unicode字符串中的重音符号的最佳方法是什么?

我在Python中有一个Unicode字符串,我想删除所有的重音符号(变音符号)。

我在网上发现了一种用Java实现此目的的优雅方法:

  1. 将Unicode字符串转换为长规范化格式(带有单独的字母和变音符号)
  2. 删除Unicode类型为“变音符号”的所有字符。

我是否需要安装pyICU之类的库,还是仅使用python标准库就可以?那python 3呢?

重要说明:我想避免使用带有重音符号到非重音符号的显式映射的代码。

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the Web an elegant way to do this in Java:

  1. convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
  2. remove all the characters whose Unicode type is “diacritic”.

Do I need to install a library such as pyICU or is this possible with just the python standard library? And what about python 3?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.


回答 0

Unidecode是正确的答案。它将所有unicode字符串音译为ASCII文本中最接近的可能表示形式。

例:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'

回答 1

这个怎么样:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

这也适用于希腊字母:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>> 

字符类别 “锰”表示Nonspacing_Mark,这是类似于MiniQuark的答案unicodedata.combining(我没想到unicodedata.combining的,但它可能是更好的解决方案,因为它更明确)。

请记住,这些操作可能会大大改变文本的含义。口音,Umlauts等不是“装饰”。

How about this:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>> 

The character category “Mn” stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark’s answer (I didn’t think of unicodedata.combining, but it is probably the better solution, because it’s more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not “decoration”.


回答 2

我刚刚在网上找到了这个答案:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

它可以正常工作(例如,对于法语),但是我认为第二步(删除重音符号)比丢弃非ASCII字符要好,因为这对于某些语言(例如希腊文)会失败。最好的解决方案可能是显式删除标记为变音符号的unicode字符。

编辑:这可以解决问题:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c)如果该字符c可以与前面的字符组合,则返回true ,这主要是如果它是一个变音符。

编辑2remove_accents需要一个unicode字符串,而不是字节字符串。如果您有字节字符串,则必须将其解码为一个unicode字符串,如下所示:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

I just found this answer on the Web:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

Edit: this does the trick:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it’s a diacritic.

Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

回答 3

实际上,我正在开发与项目兼容的python 2.6、2.7和3.4,并且必须从免费用户条目中创建ID。

多亏了您,我创建了一个可以实现奇迹的功能。

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

结果:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.

Thanks to you, I have created this function that works wonders.

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

result:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

回答 4

这不仅处理重音,而且还处理“笔画”(如ø等):

import unicodedata as ud

def rmdiacritics(char):
    '''
    Return the base character of char, by "removing" any
    diacritics like accents or curls and strokes and the like.
    '''
    desc = ud.name(char)
    cutoff = desc.find(' WITH ')
    if cutoff != -1:
        desc = desc[:cutoff]
        try:
            char = ud.lookup(desc)
        except KeyError:
            pass  # removing "WITH ..." produced an invalid name
    return char

这是我能想到的最优雅的方式(alexis在此页的评论中已经提到),尽管我认为这确实不是很优雅。实际上,正如注释中所指出的那样,这更像是一种黑客,因为Unicode名称是–实际上只是名称,它们不能保证其一致性或任何形式。

由于它们的Unicode名称中不包含“ WITH”,因此仍有一些特殊的字母无法对此进行处理,例如转弯和倒转字母。无论如何,这取决于您想做什么。有时我需要重音符号来实现字典的排序顺序。

编辑说明:

合并了注释中的建议(处理查找错误,Python-3代码)。

This handles not only accents, but also “strokes” (as in ø etc.):

import unicodedata as ud

def rmdiacritics(char):
    '''
    Return the base character of char, by "removing" any
    diacritics like accents or curls and strokes and the like.
    '''
    desc = ud.name(char)
    cutoff = desc.find(' WITH ')
    if cutoff != -1:
        desc = desc[:cutoff]
        try:
            char = ud.lookup(desc)
        except KeyError:
            pass  # removing "WITH ..." produced an invalid name
    return char

This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don’t think it is very elegant indeed. In fact, it’s more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.

There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain ‘WITH’. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.

EDIT NOTE:

Incorporated suggestions from the comments (handling lookup errors, Python-3 code).


回答 5

回应@MiniQuark的回答:

我试图读取一个半法语的csv文件(包含重音符号)以及一些最终会变成整数和浮点数的字符串。作为测试,我创建了一个如下所示的test.txt文件:

蒙特利尔,于伯,12.89,梅尔,弗朗索瓦,诺尔,889

我必须包括行23使其起作用(在python票证中找到),并包含@Jabba的注释:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

结果:

Montreal
uber
12.89
Mere
Francoise
noel
889

(注意:我在Mac OS X 10.8.4上并使用Python 2.7.3)

In response to @MiniQuark’s answer:

I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats. As a test, I created a test.txt file that looked like this:

Montréal, über, 12.89, Mère, Françoise, noël, 889

I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate @Jabba’s comment:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

The result:

Montreal
uber
12.89
Mere
Francoise
noel
889

(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)


回答 6

gensim.utils.deaccent(文本)Gensim -人类主题建模

'Sef chomutovskych komunistu dostal postou bily prasek'

另一个解决方案是unidecode

需要注意的是,用建议的解决方案unicodedata通常只在某些字符去掉口音(例如,它变成'ł''',而不是进入'l')。

gensim.utils.deaccent(text) from Gensim – topic modelling for humans:

'Sef chomutovskych komunistu dostal postou bily prasek'

Another solution is unidecode.

Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').


回答 7

一些语言结合了变音符号作为语言字母和重音符号来指定重音。

我认为更明确地指定要去除的折光度数更安全:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

Some languages have combining diacritics as language letters and accent diacritics to specify accent.

I think it is more safe to specify explicitly what diactrics you want to strip:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))