在Python中转义HTML的最简单方法是什么?

问题:在Python中转义HTML的最简单方法是什么?

cgi.escape似乎是一种可能的选择。它运作良好吗?有什么更好的东西吗?

cgi.escape seems like one possible choice. Does it work well? Is there something that is considered better?


回答 0

cgi.escape很好 它逃脱了:

  • <&lt;
  • >&gt;
  • &&amp;

对于所有HTML而言,这就足够了。

编辑:如果您有非ASCII字符,您还想转义,以便包含在使用不同编码的另一个编码文档中,如Craig所说,只需使用:

data.encode('ascii', 'xmlcharrefreplace')

不要忘了解码dataunicode第一,使用任何编码它编码的。

但是根据我的经验,如果您unicode从头开始一直都在工作,那么这种编码是没有用的。只需在文档头中指定的编码末尾进行编码(utf-8以实现最大兼容性)。

例:

>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'&lt;a&gt;b&#225;&lt;/a&gt;

另外值得一提的(感谢Greg)是额外的quote参数cgi.escape。将其设置为时Truecgi.escape还转义双引号字符("),因此您可以在XML / HTML属性中使用结果值。

编辑:请注意,cgi.escape已在Python 3.2中弃用,转而使用html.escape,它的功能相同,但quote默认情况下为True。

cgi.escape is fine. It escapes:

  • < to &lt;
  • > to &gt;
  • & to &amp;

That is enough for all HTML.

EDIT: If you have non-ascii chars you also want to escape, for inclusion in another encoded document that uses a different encoding, like Craig says, just use:

data.encode('ascii', 'xmlcharrefreplace')

Don’t forget to decode data to unicode first, using whatever encoding it was encoded.

However in my experience that kind of encoding is useless if you just work with unicode all the time from start. Just encode at the end to the encoding specified in the document header (utf-8 for maximum compatibility).

Example:

>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'&lt;a&gt;b&#225;&lt;/a&gt;

Also worth of note (thanks Greg) is the extra quote parameter cgi.escape takes. With it set to True, cgi.escape also escapes double quote chars (") so you can use the resulting value in a XML/HTML attribute.

EDIT: Note that cgi.escape has been deprecated in Python 3.2 in favor of html.escape, which does the same except that quote defaults to True.


回答 1

在Python 3.2中,新 html,引入模块,该模块用于从HTML标记转义保留字符。

它具有一个功能escape()

>>> import html
>>> html.escape('x > 2 && x < 7 single quote: \' double quote: "')
'x &gt; 2 &amp;&amp; x &lt; 7 single quote: &#x27; double quote: &quot;'

In Python 3.2 a new html module was introduced, which is used for escaping reserved characters from HTML markup.

It has one function escape():

>>> import html
>>> html.escape('x > 2 && x < 7 single quote: \' double quote: "')
'x &gt; 2 &amp;&amp; x &lt; 7 single quote: &#x27; double quote: &quot;'

回答 2

如果您希望在URL中转义HTML:

这可能不是OP想要的(问题并没有明确指出转义是在哪种上下文中使用的),但是Python的本机库urllib有一种方法可以转义需要安全包含在URL中的HTML实体。

以下是一个示例:

#!/usr/bin/python
from urllib import quote

x = '+<>^&'
print quote(x) # prints '%2B%3C%3E%5E%26'

在这里找到文件

If you wish to escape HTML in a URL:

This is probably NOT what the OP wanted (the question doesn’t clearly indicate in which context the escaping is meant to be used), but Python’s native library urllib has a method to escape HTML entities that need to be included in a URL safely.

The following is an example:

#!/usr/bin/python
from urllib import quote

x = '+<>^&'
print quote(x) # prints '%2B%3C%3E%5E%26'

Find docs here


回答 3

还有出色的markupsafe软件包

>>> from markupsafe import Markup, escape
>>> escape("<script>alert(document.cookie);</script>")
Markup(u'&lt;script&gt;alert(document.cookie);&lt;/script&gt;')

markupsafe程序包经过精心设计,并且可能是逃避转义的最通用,最Python化的方法,恕我直言,因为:

  1. return(Markup)是从unicode派生的类(即isinstance(escape('str'), unicode) == True
  2. 它可以正确处理unicode输入
  3. 它适用于Python(2.6、2.7、3.3和pypy)
  4. 它尊重对象(即具有__html__属性的对象)和模板重载(__html_format__)的自定义方法。

There is also the excellent markupsafe package.

>>> from markupsafe import Markup, escape
>>> escape("<script>alert(document.cookie);</script>")
Markup(u'&lt;script&gt;alert(document.cookie);&lt;/script&gt;')

The markupsafe package is well engineered, and probably the most versatile and Pythonic way to go about escaping, IMHO, because:

  1. the return (Markup) is a class derived from unicode (i.e. isinstance(escape('str'), unicode) == True
  2. it properly handles unicode input
  3. it works in Python (2.6, 2.7, 3.3, and pypy)
  4. it respects custom methods of objects (i.e. objects with a __html__ property) and template overloads (__html_format__).

回答 4

cgi.escape 从转义HTML标记和字符实体的有限意义上讲,应该可以逃脱HTML。

但是,您可能还必须考虑编码问题:如果要引用的HTML在特定的编码中包含非ASCII字符,那么还必须注意在引用时要合理地表示这些字符。也许您可以将它们转换为实体。否则,您应确保在“源” HTML和嵌入页面之间进行正确的编码转换,以避免损坏非ASCII字符。

cgi.escape should be good to escape HTML in the limited sense of escaping the HTML tags and character entities.

But you might have to also consider encoding issues: if the HTML you want to quote has non-ASCII characters in a particular encoding, then you would also have to take care that you represent those sensibly when quoting. Perhaps you could convert them to entities. Otherwise you should ensure that the correct encoding translations are done between the “source” HTML and the page it’s embedded in, to avoid corrupting the non-ASCII characters.


回答 5

没有库,纯python,可以安全地将文本转义为html文本:

text.replace('&', '&amp;').replace('>', '&gt;').replace('<', '&lt;'
        ).encode('ascii', 'xmlcharrefreplace')

No libraries, pure python, safely escapes text into html text:

text.replace('&', '&amp;').replace('>', '&gt;').replace('<', '&lt;'
        ).encode('ascii', 'xmlcharrefreplace')

回答 6

cgi.escape 扩展的

此版本进行了改进cgi.escape。它还保留空格和换行符。返回一个unicode字符串。

def escape_html(text):
    """escape strings for display in HTML"""
    return cgi.escape(text, quote=True).\
           replace(u'\n', u'<br />').\
           replace(u'\t', u'&emsp;').\
           replace(u'  ', u' &nbsp;')

例如

>>> escape_html('<foo>\nfoo\t"bar"')
u'&lt;foo&gt;<br />foo&emsp;&quot;bar&quot;'

cgi.escape extended

This version improves cgi.escape. It also preserves whitespace and newlines. Returns a unicode string.

def escape_html(text):
    """escape strings for display in HTML"""
    return cgi.escape(text, quote=True).\
           replace(u'\n', u'<br />').\
           replace(u'\t', u'&emsp;').\
           replace(u'  ', u' &nbsp;')

for example

>>> escape_html('<foo>\nfoo\t"bar"')
u'&lt;foo&gt;<br />foo&emsp;&quot;bar&quot;'

回答 7

不是最简单的方法,但仍然很简单。与cgi.escape模块的主要区别-如果您已经&amp;在文本中使用了它,它仍然可以正常工作。从评论中可以看到:

cgi.escape版本

def escape(s, quote=None):
    '''Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
is also translated.'''
    s = s.replace("&", "&amp;") # Must be done first!
    s = s.replace("<", "&lt;")
    s = s.replace(">", "&gt;")
    if quote:
        s = s.replace('"', "&quot;")
    return s

正则表达式版本

QUOTE_PATTERN = r"""([&<>"'])(?!(amp|lt|gt|quot|#39);)"""
def escape(word):
    """
    Replaces special characters <>&"' to HTML-safe sequences. 
    With attention to already escaped characters.
    """
    replace_with = {
        '<': '&gt;',
        '>': '&lt;',
        '&': '&amp;',
        '"': '&quot;', # should be escaped in attributes
        "'": '&#39'    # should be escaped in attributes
    }
    quote_pattern = re.compile(QUOTE_PATTERN)
    return re.sub(quote_pattern, lambda x: replace_with[x.group(0)], word)

Not the easiest way, but still straightforward. The main difference from cgi.escape module – it still will work properly if you already have &amp; in your text. As you see from comments to it:

cgi.escape version

def escape(s, quote=None):
    '''Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
is also translated.'''
    s = s.replace("&", "&amp;") # Must be done first!
    s = s.replace("<", "&lt;")
    s = s.replace(">", "&gt;")
    if quote:
        s = s.replace('"', "&quot;")
    return s

regex version

QUOTE_PATTERN = r"""([&<>"'])(?!(amp|lt|gt|quot|#39);)"""
def escape(word):
    """
    Replaces special characters <>&"' to HTML-safe sequences. 
    With attention to already escaped characters.
    """
    replace_with = {
        '<': '&gt;',
        '>': '&lt;',
        '&': '&amp;',
        '"': '&quot;', # should be escaped in attributes
        "'": '&#39'    # should be escaped in attributes
    }
    quote_pattern = re.compile(QUOTE_PATTERN)
    return re.sub(quote_pattern, lambda x: replace_with[x.group(0)], word)

回答 8

对于Python 2.7中的旧代码,可以通过BeautifulSoup4做到:

>>> bs4.dammit import EntitySubstitution
>>> esub = EntitySubstitution()
>>> esub.substitute_html("r&d")
'r&amp;d'

For legacy code in Python 2.7, can do it via BeautifulSoup4:

>>> bs4.dammit import EntitySubstitution
>>> esub = EntitySubstitution()
>>> esub.substitute_html("r&d")
'r&amp;d'