标签归档:html

从pandas数据框转换为html时,如何在html中显示完整(非截断)的数据框信息?

问题:从pandas数据框转换为html时,如何在html中显示完整(非截断)的数据框信息?

我使用该DataFrame.to_html函数将pandas数据框转换为html输出。当我将其保存到单独的html文件中时,该文件显示截断的输出。

例如,在我的“文字”列中,

df.head(1) 将会呈现

这部电影是很棒的努力。

代替

这部电影是在解构这一时期盛行的复杂社会情绪方面做出的出色努力。

在大屏幕熊猫数据框的屏幕友好格式的情况下,这种表示形式很好,但是我需要一个HTML文件,该文件将显示数据框中包含的完整表格数据,即,将显示后一个文本元素而不是前一段文字。

如何在html版本的信息的TEXT列中显示每个元素的完整,不截断的文本数据?我可以想象html表必须显示长单元格才能显示完整的数据,但是据我所知,只能将列宽参数传递给该DataFrame.to_html函数。

I converted a pandas dataframe to an html output using the DataFrame.to_html function. When I save this to a separate html file, the file shows truncated output.

For example, in my TEXT column,

df.head(1) will show

The film was an excellent effort…

instead of

The film was an excellent effort in deconstructing the complex social sentiments that prevailed during this period.

This rendition is fine in the case of a screen-friendly format of a massive pandas dataframe, but I need an html file that will show complete tabular data contained in the dataframe, that is, something that will show the latter text element rather than the former text snippet.

How would I be able to show the complete, non-truncated text data for each element in my TEXT column in the html version of the information? I would imagine that the html table would have to display long cells to show the complete data, but as far as I understand, only column-width parameters can be passed into the DataFrame.to_html function.


回答 0

display.max_colwidth选项设置为-1

pd.set_option('display.max_colwidth', -1)

set_option docs

例如,在iPython中,我们看到信息被截断为50个字符。多余的部分省略:

如果设置该display.max_colwidth选项,则信息将完整显示:

Set the display.max_colwidth option to -1:

pd.set_option('display.max_colwidth', -1)

set_option docs

For example, in iPython, we see that the information is truncated to 50 characters. Anything in excess is ellipsized:

If you set the display.max_colwidth option, the information will be displayed fully:


回答 1

pd.set_option('display.max_columns', None)  

id (第二个参数)可以完全显示列。

pd.set_option('display.max_columns', None)  

id (second argument) can fully show the columns.


回答 2

虽然pd.set_option('display.max_columns', None)套所示的最大列数,选项pd.set_option('display.max_colwidth', -1)设置每个单个场的最大宽度。

出于我的目的,我编写了一个小的辅助函数,以在不影响其余代码的情况下完全打印大型数据帧,还重新格式化了浮点数并设置了虚拟显示宽度。您可以在用例中采用它。

def print_full(x):
    pd.set_option('display.max_rows', len(x))
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', 2000)
    pd.set_option('display.float_format', '{:20,.2f}'.format)
    pd.set_option('display.max_colwidth', None)
    print(x)
    pd.reset_option('display.max_rows')
    pd.reset_option('display.max_columns')
    pd.reset_option('display.width')
    pd.reset_option('display.float_format')
    pd.reset_option('display.max_colwidth')

While pd.set_option('display.max_columns', None) sets the number of the maximum columns shown, the option pd.set_option('display.max_colwidth', -1) sets the maximum width of each single field.

For my purposes I wrote a small helper function to fully print huge data frames without affecting the rest of the code, it also reformats float numbers and sets the virtual display width. You may adopt it for your use cases.

def print_full(x):
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', 2000)
    pd.set_option('display.float_format', '{:20,.2f}'.format)
    pd.set_option('display.max_colwidth', None)
    print(x)
    pd.reset_option('display.max_rows')
    pd.reset_option('display.max_columns')
    pd.reset_option('display.width')
    pd.reset_option('display.float_format')
    pd.reset_option('display.max_colwidth')

回答 3

对于那些希望这样做的人。我在dask中找不到类似的选项,但是如果我只是在同一本笔记本中为熊猫做这件事,那么它也适用于dask。

import pandas as pd
import dask.dataframe as dd
pd.set_option('display.max_colwidth', -1) # This will set the no truncate for pandas as well as for dask. Not sure how it does for dask though. but it works

train_data = dd.read_csv('./data/train.csv')    
train_data.head(5)

For those looking to do this in dask. I could not find a similar option in dask but if I simply do this in same notebook for pandas it works for dask too.

import pandas as pd
import dask.dataframe as dd
pd.set_option('display.max_colwidth', -1) # This will set the no truncate for pandas as well as for dask. Not sure how it does for dask though. but it works

train_data = dd.read_csv('./data/train.csv')    
train_data.head(5)

回答 4

以下代码导致以下错误:

pd.set_option('display.max_colwidth', -1)

FutureWarning:在1.0版中不建议使用传递负整数,在将来的版本中将不支持。而是使用“无”不限制列宽。

而是使用:

pd.set_option('display.max_colwidth', None)

这样就完成了任务并符合1.0 版之后的熊猫版本。

The following code results in the error below:

pd.set_option('display.max_colwidth', -1)

FutureWarning: Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.

Instead, use:

pd.set_option('display.max_colwidth', None)

This accomplishes the task and complies with versions of pandas following version 1.0.


在Python中从字符串中剥离HTML

问题:在Python中从字符串中剥离HTML

from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
  print line

当在HTML文件中打印一行时,我试图找到一种仅显示每个HTML元素的内容而不显示格式本身的方法。如果找到'<a href="whatever.com">some text</a>',它将仅打印“某些文本”,'<b>hello</b>'打印“ hello”,等等。如何做到这一点?

from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
  print line

When printing a line in an HTML file, I’m trying to find a way to only show the contents of each HTML element and not the formatting itself. If it finds '<a href="whatever.com">some text</a>', it will only print ‘some text’, '<b>hello</b>' prints ‘hello’, etc. How would one go about doing this?


回答 0

我一直使用此函数来剥离HTML标记,因为它仅需要Python stdlib:

对于Python 3:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

对于Python 2:

from HTMLParser import HTMLParser
from StringIO import StringIO

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

I always used this function to strip HTML tags, as it requires only the Python stdlib:

For Python 3:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

For Python 2:

from HTMLParser import HTMLParser
from StringIO import StringIO

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

回答 1

我没有想太多会丢失的情况,但是您可以做一个简单的正则表达式:

re.sub('<[^<]+?>', '', text)

对于不了解正则表达式的用户,这会搜索一个字符串<...>,其内部内容由一个或多个(+)而不是的字符组成<。该?意味着它会匹配它所能找到的最小字符串。例如<p>Hello</p>,它将与匹配<'p></p>分开?。没有它,它将匹配整个字符串<..Hello..>

如果非标记<出现在html中(例如2 < 3),则&...无论如何都应将其标记为转义序列,因此^<可能不必要。

I haven’t thought much about the cases it will miss, but you can do a simple regex:

re.sub('<[^<]+?>', '', text)

For those that don’t understand regex, this searches for a string <...>, where the inner content is made of one or more (+) characters that isn’t a <. The ? means that it will match the smallest string it can find. For example given <p>Hello</p>, it will match <'p> and </p> separately with the ?. Without it, it will match the entire string <..Hello..>.

If non-tag < appears in html (eg. 2 < 3), it should be written as an escape sequence &... anyway so the ^< may be unnecessary.


回答 2

您可以使用BeautifulSoup get_text()功能。

from bs4 import BeautifulSoup

html_str = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(html_str)

print(soup.get_text()) 
#or via attribute of Soup Object: print(soup.text)

建议明确指定解析器(例如)BeautifulSoup(html_str, features="html.parser"),以使输出可再现。

You can use BeautifulSoup get_text() feature.

from bs4 import BeautifulSoup

html_str = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(html_str)

print(soup.get_text()) 
#or via attribute of Soup Object: print(soup.text)

It is advisable to explicitly specify the parser, for example as BeautifulSoup(html_str, features="html.parser"), for the output to be reproducible.


回答 3

精简版!

import re, cgi
tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')

# Remove well-formed tags, fixing mistakes by legitimate users
no_tags = tag_re.sub('', user_input)

# Clean up anything else by escaping
ready_for_web = cgi.escape(no_tags)

正则表达式来源:MarkupSafe。它们的版本也可以处理HTML实体,而这一版本却不能。

为什么我不能只剥离标签并留下标签?

让人们远离<i>italicizing</i>事物,而又不让事物i浮起是一回事。但是,接受任意输入并使其完全无害是另一回事。此页面上的大多数技术都会保留未封闭的注释(<!--)和不属于标签(blah <<<><blah)的尖括号等内容。如果HTMLParser版本在未封闭的注释中,则它们甚至可以保留完整的标签。

如果您的模板是{{ firstname }} {{ lastname }}什么? firstname = '<a'lastname = 'href="http://evil.com/">'会被此页面上的每个标记剥离器(@Medeiros!除外)允许通过,因为它们本身并不是完整的标记。剥离普通的HTML标签是不够的。

Django的strip_tags最佳答案的改进版本(请参见下一标题),给出以下警告:

绝对不能保证结果字符串是HTML安全的。因此,切勿在strip_tags未先转义的情况下将通话结果标记为安全,例如使用escape()

遵循他们的建议!

要使用HTMLParser剥离标签,您必须多次运行它。

绕开这个问题的最佳答案很容易。

查看以下字符串(来源和讨论):

<img<!-- --> src=x onerror=alert(1);//><!-- -->

HTMLParser第一次看到它时,无法分辨出<img...>是标签。它看起来很残破,因此HTMLParser不会摆脱它。它只取出<!-- comments -->,让您

<img src=x onerror=alert(1);//>

该问题已在2014年3月的Django项目中披露。他们的旧时strip_tags基本上与该问题的最佳答案相同。 他们的新版本基本上以循环方式运行它,直到再次运行它不会更改字符串为止:

# _strip_once runs HTMLParser once, pulling out just the text of all the nodes.

def strip_tags(value):
    """Returns the given HTML with all tags stripped."""
    # Note: in typical case this loop executes _strip_once once. Loop condition
    # is redundant, but helps to reduce number of executions of _strip_once.
    while '<' in value and '>' in value:
        new_value = _strip_once(value)
        if len(new_value) >= len(value):
            # _strip_once was not able to detect more tags
            break
        value = new_value
    return value

当然,如果您始终逃避的结果,那么这都不是问题strip_tags()

2015年3月19日更新:1.4.20、1.6.11、1.7.7和1.8c1之前的Django版本中存在错误。这些版本可能会在strip_tags()函数中进入无限循环。固定版本如上复制。 更多细节在这里

复制或使用好东西

我的示例代码无法处理HTML实体-Django和MarkupSafe打包版本可以处理HTML实体。

我的示例代码是从出色的MarkupSafe库中提取的,以防止跨站点脚本编写。它既方便又快速(C加速到其本机Python版本)。它包含在Google App Engine中,并由Jinja2(2.7及更高版本),Mako,Pylons等使用。它可以轻松地与Django 1.7中的Django模板一起使用。

Django的strip_tags和最新版本的其他html实用程序都不错,但是我发现它们不如MarkupSafe方便。它们非常独立,您可以从此文件中复制所需内容。

如果您需要剥离几乎所有标签,则Bleach库很好。您可以让它强制执行诸如“我的用户可以将其斜体显示,但他们不能创建iframe”之类的规则。

了解标签剥离器的属性!对它进行模糊测试! 这是我用来对此答案进行研究的代码

令人毛骨悚然的注释 -问题本身是关于打印到控制台的问题,但这是Google针对“从字符串中提取python剥离html”的最高结果,所以这就是为什么网上答案是99%。

Short version!

import re, cgi
tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')

# Remove well-formed tags, fixing mistakes by legitimate users
no_tags = tag_re.sub('', user_input)

# Clean up anything else by escaping
ready_for_web = cgi.escape(no_tags)

Regex source: MarkupSafe. Their version handles HTML entities too, while this quick one doesn’t.

Why can’t I just strip the tags and leave it?

It’s one thing to keep people from <i>italicizing</i> things, without leaving is floating around. But it’s another to take arbitrary input and make it completely harmless. Most of the techniques on this page will leave things like unclosed comments (<!--) and angle-brackets that aren’t part of tags (blah <<<><blah) intact. The HTMLParser version can even leave complete tags in, if they’re inside an unclosed comment.

What if your template is {{ firstname }} {{ lastname }}? firstname = '<a' and lastname = 'href="http://evil.com/">' will be let through by every tag stripper on this page (except @Medeiros!), because they’re not complete tags on their own. Stripping out normal HTML tags is not enough.

Django’s strip_tags, an improved (see next heading) version of the top answer to this question, gives the following warning:

Absolutely NO guarantee is provided about the resulting string being HTML safe. So NEVER mark safe the result of a strip_tags call without escaping it first, for example with escape().

Follow their advice!

To strip tags with HTMLParser, you have to run it multiple times.

It’s easy to circumvent the top answer to this question.

Look at this string (source and discussion):

<img<!-- --> src=x onerror=alert(1);//><!-- -->

The first time HTMLParser sees it, it can’t tell that the <img...> is a tag. It looks broken, so HTMLParser doesn’t get rid of it. It only takes out the <!-- comments -->, leaving you with

<img src=x onerror=alert(1);//>

This problem was disclosed to the Django project in March, 2014. Their old strip_tags was essentially the same as the top answer to this question. Their new version basically runs it in a loop until running it again doesn’t change the string:

# _strip_once runs HTMLParser once, pulling out just the text of all the nodes.

def strip_tags(value):
    """Returns the given HTML with all tags stripped."""
    # Note: in typical case this loop executes _strip_once once. Loop condition
    # is redundant, but helps to reduce number of executions of _strip_once.
    while '<' in value and '>' in value:
        new_value = _strip_once(value)
        if len(new_value) >= len(value):
            # _strip_once was not able to detect more tags
            break
        value = new_value
    return value

Of course, none of this is an issue if you always escape the result of strip_tags().

Update 19 March, 2015: There was a bug in Django versions before 1.4.20, 1.6.11, 1.7.7, and 1.8c1. These versions could enter an infinite loop in the strip_tags() function. The fixed version is reproduced above. More details here.

Good things to copy or use

My example code doesn’t handle HTML entities – the Django and MarkupSafe packaged versions do.

My example code is pulled from the excellent MarkupSafe library for cross-site scripting prevention. It’s convenient and fast (with C speedups to its native Python version). It’s included in Google App Engine, and used by Jinja2 (2.7 and up), Mako, Pylons, and more. It works easily with Django templates from Django 1.7.

Django’s strip_tags and other html utilities from a recent version are good, but I find them less convenient than MarkupSafe. They’re pretty self-contained, you could copy what you need from this file.

If you need to strip almost all tags, the Bleach library is good. You can have it enforce rules like “my users can italicize things, but they can’t make iframes.”

Understand the properties of your tag stripper! Run fuzz tests on it! Here is the code I used to do the research for this answer.

sheepish note – The question itself is about printing to the console, but this is the top Google result for “python strip html from string”, so that’s why this answer is 99% about the web.


回答 4

我需要一种剥离标签并将 HTML实体解码为纯文本的方法。以下解决方案基于Eloff的答案(我无法使用,因为它剥离了实体)。

from HTMLParser import HTMLParser
import htmlentitydefs

class HTMLTextExtractor(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.result = [ ]

    def handle_data(self, d):
        self.result.append(d)

    def handle_charref(self, number):
        codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
        self.result.append(unichr(codepoint))

    def handle_entityref(self, name):
        codepoint = htmlentitydefs.name2codepoint[name]
        self.result.append(unichr(codepoint))

    def get_text(self):
        return u''.join(self.result)

def html_to_text(html):
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

快速测试:

html = u'<a href="#">Demo <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>'
print repr(html_to_text(html))

结果:

u'Demo (\xac \u0394\u03b7\u03bc\u03ce)'

错误处理:

  • 无效的HTML结构可能导致HTMLParseError
  • 无效的命名HTML实体(例如&#apos;,在XML和XHTML中有效,但在纯HTML中无效)将导致ValueError异常。
  • 指定代码点超出Python可接受的Unicode范围的数字HTML实体(例如,在某些系统上,基本多语言平面之外的字符)将导致ValueError异常。

安全说明:不要将HTML剥离(将HTML转换为纯文本)与HTML清理(将纯文本转换为HTML)混淆。此答案将删除HTML并将实体解码为纯文本-这不能确保在HTML上下文中安全使用结果。

示例:&lt;script&gt;alert("Hello");&lt;/script&gt;将转换为<script>alert("Hello");</script>,这是100%正确的行为,但是如果将生成的纯文本原样插入HTML页面,则显然不够。

规则并不难:每当您在HTML输出中插入纯文本字符串时,即使您“知道”它不包含HTML(例如,因为剥离了HTML内容),也应始终使用()使用HTML转义它。cgi.escape(s, True)

(但是,OP询问有关将结果打印到控制台的情况,在这种情况下,无需转义HTML。)

Python 3.4以上版本:(带有doctest!)

import html.parser

class HTMLTextExtractor(html.parser.HTMLParser):
    def __init__(self):
        super(HTMLTextExtractor, self).__init__()
        self.result = [ ]

    def handle_data(self, d):
        self.result.append(d)

    def get_text(self):
        return ''.join(self.result)

def html_to_text(html):
    """Converts HTML to plain text (stripping tags and converting entities).
    >>> html_to_text('<a href="#">Demo<!--...--> <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>')
    'Demo (\xac \u0394\u03b7\u03bc\u03ce)'

    "Plain text" doesn't mean result can safely be used as-is in HTML.
    >>> html_to_text('&lt;script&gt;alert("Hello");&lt;/script&gt;')
    '<script>alert("Hello");</script>'

    Always use html.escape to sanitize text before using in an HTML context!

    HTMLParser will do its best to make sense of invalid HTML.
    >>> html_to_text('x < y &lt z <!--b')
    'x < y < z '

    Unrecognized named entities are included as-is. '&apos;' is recognized,
    despite being XML only.
    >>> html_to_text('&nosuchentity; &apos; ')
    "&nosuchentity; ' "
    """
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

请注意,HTMLParser在Python 3中得到了改进(意味着更少的代码和更好的错误处理)。

I needed a way to strip tags and decode HTML entities to plain text. The following solution is based on Eloff’s answer (which I couldn’t use because it strips entities).

from HTMLParser import HTMLParser
import htmlentitydefs

class HTMLTextExtractor(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.result = [ ]

    def handle_data(self, d):
        self.result.append(d)

    def handle_charref(self, number):
        codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
        self.result.append(unichr(codepoint))

    def handle_entityref(self, name):
        codepoint = htmlentitydefs.name2codepoint[name]
        self.result.append(unichr(codepoint))

    def get_text(self):
        return u''.join(self.result)

def html_to_text(html):
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

A quick test:

html = u'<a href="#">Demo <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>'
print repr(html_to_text(html))

Result:

u'Demo (\xac \u0394\u03b7\u03bc\u03ce)'

Error handling:

  • Invalid HTML structure may cause an HTMLParseError.
  • Invalid named HTML entities (such as &#apos;, which is valid in XML and XHTML, but not plain HTML) will cause a ValueError exception.
  • Numeric HTML entities specifying code points outside the Unicode range acceptable by Python (such as, on some systems, characters outside the Basic Multilingual Plane) will cause a ValueError exception.

Security note: Do not confuse HTML stripping (converting HTML into plain text) with HTML sanitizing (converting plain text into HTML). This answer will remove HTML and decode entities into plain text – that does not make the result safe to use in a HTML context.

Example: &lt;script&gt;alert("Hello");&lt;/script&gt; will be converted to <script>alert("Hello");</script>, which is 100% correct behavior, but obviously not sufficient if the resulting plain text is inserted as-is into a HTML page.

The rule is not hard: Any time you insert a plain-text string into HTML output, you should always HTML escape it (using cgi.escape(s, True)), even if you “know” that it doesn’t contain HTML (e.g. because you stripped HTML content).

(However, the OP asked about printing the result to the console, in which case no HTML escaping is needed.)

Python 3.4+ version: (with doctest!)

import html.parser

class HTMLTextExtractor(html.parser.HTMLParser):
    def __init__(self):
        super(HTMLTextExtractor, self).__init__()
        self.result = [ ]

    def handle_data(self, d):
        self.result.append(d)

    def get_text(self):
        return ''.join(self.result)

def html_to_text(html):
    """Converts HTML to plain text (stripping tags and converting entities).
    >>> html_to_text('<a href="#">Demo<!--...--> <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>')
    'Demo (\xac \u0394\u03b7\u03bc\u03ce)'

    "Plain text" doesn't mean result can safely be used as-is in HTML.
    >>> html_to_text('&lt;script&gt;alert("Hello");&lt;/script&gt;')
    '<script>alert("Hello");</script>'

    Always use html.escape to sanitize text before using in an HTML context!

    HTMLParser will do its best to make sense of invalid HTML.
    >>> html_to_text('x < y &lt z <!--b')
    'x < y < z '

    Unrecognized named entities are included as-is. '&apos;' is recognized,
    despite being XML only.
    >>> html_to_text('&nosuchentity; &apos; ')
    "&nosuchentity; ' "
    """
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

Note that HTMLParser has improved in Python 3 (meaning less code and better error handling).


回答 5

有一个简单的方法:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

此处说明了此想法:http: //youtu.be/2tu9LTDujbw

您可以在这里看到它的工作原理:http: //youtu.be/HPkNPcYed9M?t=35s

PS-如果您对类感兴趣(关于使用python进行智能调试),请给我一个链接:http : //www.udacity.com/overview/Course/cs259/CourseRev/1。免费!

别客气!:)

There’s a simple way to this:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

The idea is explained here: http://youtu.be/2tu9LTDujbw

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

PS – If you’re interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It’s free!

You’re welcome! :)


回答 6

如果您需要保留HTML实体(即&amp;),则在Eloff的answer中添加了“ handle_entityref ”方法。

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append('&%s;' % name)
    def get_data(self):
        return ''.join(self.fed)

def html_to_text(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

If you need to preserve HTML entities (i.e. &amp;), I added “handle_entityref” method to Eloff’s answer.

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append('&%s;' % name)
    def get_data(self):
        return ''.join(self.fed)

def html_to_text(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

回答 7

如果要剥离所有HTML标签,我发现的最简单方法是使用BeautifulSoup:

from bs4 import BeautifulSoup  # Or from BeautifulSoup import BeautifulSoup

def stripHtmlTags(htmlTxt):
    if htmlTxt is None:
            return None
        else:
            return ''.join(BeautifulSoup(htmlTxt).findAll(text=True)) 

我尝试了接受的答案的代码,但得到的是“ RuntimeError:超出最大递归深度”,上述代码块未发生这种情况。

If you want to strip all HTML tags the easiest way I found is using BeautifulSoup:

from bs4 import BeautifulSoup  # Or from BeautifulSoup import BeautifulSoup

def stripHtmlTags(htmlTxt):
    if htmlTxt is None:
            return None
        else:
            return ''.join(BeautifulSoup(htmlTxt).findAll(text=True)) 

I tried the code of the accepted answer but I was getting “RuntimeError: maximum recursion depth exceeded”, which didn’t happen with the above block of code.


回答 8

一个基于lxml.html的解决方案(lxml是本机库,因此比任何纯python解决方案都快得多)。

from lxml import html
from lxml.html.clean import clean_html

tree = html.fromstring("""<span class="item-summary">
                            Detailed answers to any questions you might have
                        </span>""")

print(clean_html(tree).strip())

# >>> Detailed answers to any questions you might have

另请参阅http://lxml.de/lxmlhtml.html#cleaning-up-html,了解lxml.cleaner的确切功能。

如果在转换为文本之前需要更多控制权,则可能需要通过在构造函数中传递所需的选项来显式使用lxml Cleaner,例如:

cleaner = Cleaner(page_structure=True,
                  meta=True,
                  embedded=True,
                  links=True,
                  style=True,
                  processing_instructions=True,
                  inline_style=True,
                  scripts=True,
                  javascript=True,
                  comments=True,
                  frames=True,
                  forms=True,
                  annoying_tags=True,
                  remove_unknown_tags=True,
                  safe_attrs_only=True,
                  safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                  remove_tags=('span', 'font', 'div')
                  )
sanitized_html = cleaner.clean_html(unsafe_html)

An lxml.html-based solution (lxml is a native library and can be more performant than a pure python solution).

Remove ALL tags

from lxml import html


## from file-like object or URL
tree = html.parse(file_like_object_or_url)

## from string
tree = html.fromstring('safe <script>unsafe</script> safe')

print(tree.text_content().strip())

### OUTPUT: 'safe unsafe safe'

Remove ALL tags with pre-sanitizing HTML (dropping some tags)

from lxml import html
from lxml.html.clean import clean_html

tree = html.fromstring("""<script>dangerous</script><span class="item-summary">
                            Detailed answers to any questions you might have
                        </span>""")

## text only
print(clean_html(tree).text_content().strip())

### OUTPUT: 'Detailed answers to any questions you might have'

Also see http://lxml.de/lxmlhtml.html#cleaning-up-html for what exactly the lxml.cleaner does.

If you need more control over what exactly is sanitized before converting to text then you might want to use the lxml Cleaner explicitly by passing the options you want in the constructor, e.g:

cleaner = Cleaner(page_structure=True,
                  meta=True,
                  embedded=True,
                  links=True,
                  style=True,
                  processing_instructions=True,
                  inline_style=True,
                  scripts=True,
                  javascript=True,
                  comments=True,
                  frames=True,
                  forms=True,
                  annoying_tags=True,
                  remove_unknown_tags=True,
                  safe_attrs_only=True,
                  safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                  remove_tags=('span', 'font', 'div')
                  )
sanitized_html = cleaner.clean_html(unsafe_html)

If you need more control over how plain text is generated then instead of text_content() you can use lxml.etree.tostring:

plain_bytes = tostring(tree, method='text', encoding='utf-8')
print(plain.decode('utf-8'))


回答 9

这是一个简单的解决方案,它基于惊人的快速lxml库剥离HTML标签并解码HTML实体:

from lxml import html

def strip_html(s):
    return str(html.fromstring(s).text_content())

strip_html('Ein <a href="">sch&ouml;ner</a> Text.')  # Output: Ein schöner Text.

Here is a simple solution that strips HTML tags and decodes HTML entities based on the amazingly fast lxml library:

from lxml import html

def strip_html(s):
    return str(html.fromstring(s).text_content())

strip_html('Ein <a href="">sch&ouml;ner</a> Text.')  # Output: Ein schöner Text.

回答 10

Beautiful Soup包会立即为您执行此操作。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

The Beautiful Soup package does this immediately for you.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

回答 11

这是我对python 3的解决方案。

import html
import re

def html_to_txt(html_text):
    ## unescape html
    txt = html.unescape(html_text)
    tags = re.findall("<[^>]+>",txt)
    print("found tags: ")
    print(tags)
    for tag in tags:
        txt=txt.replace(tag,'')
    return txt

不知道它是否完美,但是解决了我的用例,看起来很简单。

Here’s my solution for python 3.

import html
import re

def html_to_txt(html_text):
    ## unescape html
    txt = html.unescape(html_text)
    tags = re.findall("<[^>]+>",txt)
    print("found tags: ")
    print(tags)
    for tag in tags:
        txt=txt.replace(tag,'')
    return txt

Not sure if it is perfect, but solved my use case and seems simple.


回答 12

您可以使用其他HTML解析器(例如lxmlBeautiful Soup),该解析器提供仅提取文本的功能。或者,您可以在行字符串上运行正则表达式以去除标记。有关更多信息,请参见Python文档

You can use either a different HTML parser (like lxml, or Beautiful Soup) — one that offers functions to extract just text. Or, you can run a regex on your line string that strips out the tags. See Python docs for more.


回答 13

我已经成功地将Eloff的答案用于Python 3.1 [非常感谢!]。

我升级到Python 3.2.3,并遇到错误。

感谢响应者Thomas K 提供的解决方案是将super().__init__()以下代码插入:

def __init__(self):
    self.reset()
    self.fed = []

…为了使其看起来像这样:

def __init__(self):
    super().__init__()
    self.reset()
    self.fed = []

…,它将适用于Python 3.2.3。

再次感谢Thomas K的修复以及上面提供的Eloff原始代码!

I have used Eloff’s answer successfully for Python 3.1 [many thanks!].

I upgraded to Python 3.2.3, and ran into errors.

The solution, provided here thanks to the responder Thomas K, is to insert super().__init__() into the following code:

def __init__(self):
    self.reset()
    self.fed = []

… in order to make it look like this:

def __init__(self):
    super().__init__()
    self.reset()
    self.fed = []

… and it will work for Python 3.2.3.

Again, thanks to Thomas K for the fix and for Eloff’s original code provided above!


回答 14

您可以编写自己的函数:

def StripTags(text):
     finished = 0
     while not finished:
         finished = 1
         start = text.find("<")
         if start >= 0:
             stop = text[start:].find(">")
             if stop >= 0:
                 text = text[:start] + text[start+stop+1:]
                 finished = 0
     return text

You can write your own function:

def StripTags(text):
     finished = 0
     while not finished:
         finished = 1
         start = text.find("<")
         if start >= 0:
             stop = text[start:].find(">")
             if stop >= 0:
                 text = text[:start] + text[start+stop+1:]
                 finished = 0
     return text

回答 15

如果HTML-Parser的解决方案仅运行一次,则它们都是易碎的:

html_to_text('<<b>script>alert("hacked")<</b>/script>

结果是:

<script>alert("hacked")</script>

您打算防止的事情。如果您使用HTML解析器,请对标签计数,直到替换为零为止:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
        self.containstags = False

    def handle_starttag(self, tag, attrs):
       self.containstags = True

    def handle_data(self, d):
        self.fed.append(d)

    def has_tags(self):
        return self.containstags

    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    must_filtered = True
    while ( must_filtered ):
        s = MLStripper()
        s.feed(html)
        html = s.get_data()
        must_filtered = s.has_tags()
    return html

The solutions with HTML-Parser are all breakable, if they run only once:

html_to_text('<<b>script>alert("hacked")<</b>/script>

results in:

<script>alert("hacked")</script>

what you intend to prevent. if you use a HTML-Parser, count the Tags until zero are replaced:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
        self.containstags = False

    def handle_starttag(self, tag, attrs):
       self.containstags = True

    def handle_data(self, d):
        self.fed.append(d)

    def has_tags(self):
        return self.containstags

    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    must_filtered = True
    while ( must_filtered ):
        s = MLStripper()
        s.feed(html)
        html = s.get_data()
        must_filtered = s.has_tags()
    return html

回答 16

这是一个快速修复,甚至可以进行优化,但是效果很好。这段代码将所有非空标签替换为“”,并将所有html标签剥离成给定的输入文本。您可以使用./file.py输入输出来运行它

    #!/usr/bin/python
import sys

def replace(strng,replaceText):
    rpl = 0
    while rpl > -1:
        rpl = strng.find(replaceText)
        if rpl != -1:
            strng = strng[0:rpl] + strng[rpl + len(replaceText):]
    return strng


lessThanPos = -1
count = 0
listOf = []

try:
    #write File
    writeto = open(sys.argv[2],'w')

    #read file and store it in list
    f = open(sys.argv[1],'r')
    for readLine in f.readlines():
        listOf.append(readLine)         
    f.close()

    #remove all tags  
    for line in listOf:
        count = 0;  
        lessThanPos = -1  
        lineTemp =  line

            for char in lineTemp:

            if char == "<":
                lessThanPos = count
            if char == ">":
                if lessThanPos > -1:
                    if line[lessThanPos:count + 1] != '<>':
                        lineTemp = replace(lineTemp,line[lessThanPos:count + 1])
                        lessThanPos = -1
            count = count + 1
        lineTemp = lineTemp.replace("&lt","<")
        lineTemp = lineTemp.replace("&gt",">")                  
        writeto.write(lineTemp)  
    writeto.close() 
    print "Write To --- >" , sys.argv[2]
except:
    print "Help: invalid arguments or exception"
    print "Usage : ",sys.argv[0]," inputfile outputfile"

This is a quick fix and can be even more optimized but it will work fine. This code will replace all non empty tags with “” and strips all html tags form a given input text .You can run it using ./file.py input output

    #!/usr/bin/python
import sys

def replace(strng,replaceText):
    rpl = 0
    while rpl > -1:
        rpl = strng.find(replaceText)
        if rpl != -1:
            strng = strng[0:rpl] + strng[rpl + len(replaceText):]
    return strng


lessThanPos = -1
count = 0
listOf = []

try:
    #write File
    writeto = open(sys.argv[2],'w')

    #read file and store it in list
    f = open(sys.argv[1],'r')
    for readLine in f.readlines():
        listOf.append(readLine)         
    f.close()

    #remove all tags  
    for line in listOf:
        count = 0;  
        lessThanPos = -1  
        lineTemp =  line

            for char in lineTemp:

            if char == "<":
                lessThanPos = count
            if char == ">":
                if lessThanPos > -1:
                    if line[lessThanPos:count + 1] != '<>':
                        lineTemp = replace(lineTemp,line[lessThanPos:count + 1])
                        lessThanPos = -1
            count = count + 1
        lineTemp = lineTemp.replace("&lt","<")
        lineTemp = lineTemp.replace("&gt",">")                  
        writeto.write(lineTemp)  
    writeto.close() 
    print "Write To --- >" , sys.argv[2]
except:
    print "Help: invalid arguments or exception"
    print "Usage : ",sys.argv[0]," inputfile outputfile"

回答 17

søren-løvborg答案的python 3改编

from html.parser import HTMLParser
from html.entities import html5

class HTMLTextExtractor(HTMLParser):
    """ Adaption of http://stackoverflow.com/a/7778368/196732 """
    def __init__(self):
        super().__init__()
        self.result = []

    def handle_data(self, d):
        self.result.append(d)

    def handle_charref(self, number):
        codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
        self.result.append(unichr(codepoint))

    def handle_entityref(self, name):
        if name in html5:
            self.result.append(unichr(html5[name]))

    def get_text(self):
        return u''.join(self.result)

def html_to_text(html):
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

A python 3 adaption of søren-løvborg’s answer

from html.parser import HTMLParser
from html.entities import html5

class HTMLTextExtractor(HTMLParser):
    """ Adaption of http://stackoverflow.com/a/7778368/196732 """
    def __init__(self):
        super().__init__()
        self.result = []

    def handle_data(self, d):
        self.result.append(d)

    def handle_charref(self, number):
        codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
        self.result.append(unichr(codepoint))

    def handle_entityref(self, name):
        if name in html5:
            self.result.append(unichr(html5[name]))

    def get_text(self):
        return u''.join(self.result)

def html_to_text(html):
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

回答 18

对于一个项目,我需要剥离HTML,同时剥离CSS和js。因此,我对Eloffs回答做了一个变体:

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
        self.css = False
    def handle_starttag(self, tag, attrs):
        if tag == "style" or tag=="script":
            self.css = True
    def handle_endtag(self, tag):
        if tag=="style" or tag=="script":
            self.css=False
    def handle_data(self, d):
        if not self.css:
            self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

For one project, I needed so strip HTML, but also css and js. Thus, I made a variation of Eloffs answer:

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
        self.css = False
    def handle_starttag(self, tag, attrs):
        if tag == "style" or tag=="script":
            self.css = True
    def handle_endtag(self, tag):
        if tag=="style" or tag=="script":
            self.css=False
    def handle_data(self, d):
        if not self.css:
            self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

回答 19

这是一种与当前接受的答案(https://stackoverflow.com/a/925630/95989)类似的解决方案,除了它HTMLParser直接使用内部类(即没有子类),从而使其简洁得多:

def strip_html(文字):
    零件= []                                                                      
    解析器= HTMLParser()                                                           
    parser.handle_data = parts.append                                               
    parser.feed(文本)                                                               
    返回''.join(parts)

Here’s a solution similar to the currently accepted answer (https://stackoverflow.com/a/925630/95989), except that it uses the internal HTMLParser class directly (i.e. no subclassing), thereby making it significantly more terse:

def strip_html(text):
    parts = []                                                                      
    parser = HTMLParser()                                                           
    parser.handle_data = parts.append                                               
    parser.feed(text)                                                               
    return ''.join(parts)

回答 20

我正在解析Github自述文件,发现以下内容确实有效:

import re
import lxml.html

def strip_markdown(x):
    links_sub = re.sub(r'\[(.+)\]\([^\)]+\)', r'\1', x)
    bold_sub = re.sub(r'\*\*([^*]+)\*\*', r'\1', links_sub)
    emph_sub = re.sub(r'\*([^*]+)\*', r'\1', bold_sub)
    return emph_sub

def strip_html(x):
    return lxml.html.fromstring(x).text_content() if x else ''

然后

readme = """<img src="https://raw.githubusercontent.com/kootenpv/sky/master/resources/skylogo.png" />

            sky is a web scraping framework, implemented with the latest python versions in mind (3.4+). 
            It uses the asynchronous `asyncio` framework, as well as many popular modules 
            and extensions.

            Most importantly, it aims for **next generation** web crawling where machine intelligence 
            is used to speed up the development/maintainance/reliability of crawling.

            It mainly does this by considering the user to be interested in content 
            from *domains*, not just a collection of *single pages*
            ([templating approach](#templating-approach))."""

strip_markdown(strip_html(readme))

正确删除所有markdown和html。

I’m parsing Github readmes and I find that the following really works well:

import re
import lxml.html

def strip_markdown(x):
    links_sub = re.sub(r'\[(.+)\]\([^\)]+\)', r'\1', x)
    bold_sub = re.sub(r'\*\*([^*]+)\*\*', r'\1', links_sub)
    emph_sub = re.sub(r'\*([^*]+)\*', r'\1', bold_sub)
    return emph_sub

def strip_html(x):
    return lxml.html.fromstring(x).text_content() if x else ''

And then

readme = """<img src="https://raw.githubusercontent.com/kootenpv/sky/master/resources/skylogo.png" />

            sky is a web scraping framework, implemented with the latest python versions in mind (3.4+). 
            It uses the asynchronous `asyncio` framework, as well as many popular modules 
            and extensions.

            Most importantly, it aims for **next generation** web crawling where machine intelligence 
            is used to speed up the development/maintainance/reliability of crawling.

            It mainly does this by considering the user to be interested in content 
            from *domains*, not just a collection of *single pages*
            ([templating approach](#templating-approach))."""

strip_markdown(strip_html(readme))

Removes all markdown and html correctly.


回答 21

大多数情况下,使用BeautifulSoup,html2text或@Eloff中的代码,它仍然保留一些html元素,javascript代码…

因此,您可以结合使用这些库并删除markdown格式(Python 3):

import re
import html2text
from bs4 import BeautifulSoup
def html2Text(html):
    def removeMarkdown(text):
        for current in ["^[ #*]{2,30}", "^[ ]{0,30}\d\\\.", "^[ ]{0,30}\d\."]:
            markdown = re.compile(current, flags=re.MULTILINE)
            text = markdown.sub(" ", text)
        return text
    def removeAngular(text):
        angular = re.compile("[{][|].{2,40}[|][}]|[{][*].{2,40}[*][}]|[{][{].{2,40}[}][}]|\[\[.{2,40}\]\]")
        text = angular.sub(" ", text)
        return text
    h = html2text.HTML2Text()
    h.images_to_alt = True
    h.ignore_links = True
    h.ignore_emphasis = False
    h.skip_internal_links = True
    text = h.handle(html)
    soup = BeautifulSoup(text, "html.parser")
    text = soup.text
    text = removeAngular(text)
    text = removeMarkdown(text)
    return text

它对我来说效果很好,但是可以增强,当然…

Using BeautifulSoup, html2text or the code from @Eloff, most of the time, it remains some html elements, javascript code…

So you can use a combination of these libraries and delete markdown formatting (Python 3):

import re
import html2text
from bs4 import BeautifulSoup
def html2Text(html):
    def removeMarkdown(text):
        for current in ["^[ #*]{2,30}", "^[ ]{0,30}\d\\\.", "^[ ]{0,30}\d\."]:
            markdown = re.compile(current, flags=re.MULTILINE)
            text = markdown.sub(" ", text)
        return text
    def removeAngular(text):
        angular = re.compile("[{][|].{2,40}[|][}]|[{][*].{2,40}[*][}]|[{][{].{2,40}[}][}]|\[\[.{2,40}\]\]")
        text = angular.sub(" ", text)
        return text
    h = html2text.HTML2Text()
    h.images_to_alt = True
    h.ignore_links = True
    h.ignore_emphasis = False
    h.skip_internal_links = True
    text = h.handle(html)
    soup = BeautifulSoup(text, "html.parser")
    text = soup.text
    text = removeAngular(text)
    text = removeMarkdown(text)
    return text

It works well for me but it can be enhanced, of course…


回答 22

简单的代码!这将删除其中的所有标记和内容。

def rm(s):
    start=False
    end=False
    s=' '+s
    for i in range(len(s)-1):
        if i<len(s):
            if start!=False:
                if s[i]=='>':
                    end=i
                    s=s[:start]+s[end+1:]
                    start=end=False
            else:
                if s[i]=='<':
                    start=i
    if s.count('<')>0:
        self.rm(s)
    else:
        s=s.replace('&nbsp;', ' ')
        return s

但是,如果文本中包含<>符号,则不会给出完整的结果。

Simple code!. This will remove all kind of tags and content inside of it.

def rm(s):
    start=False
    end=False
    s=' '+s
    for i in range(len(s)-1):
        if i<len(s):
            if start!=False:
                if s[i]=='>':
                    end=i
                    s=s[:start]+s[end+1:]
                    start=end=False
            else:
                if s[i]=='<':
                    start=i
    if s.count('<')>0:
        self.rm(s)
    else:
        s=s.replace('&nbsp;', ' ')
        return s

But it won’t give full result if text contains <> symbols inside it.


回答 23

# This is a regex solution.
import re
def removeHtml(html):
  if not html: return html
  # Remove comments first
  innerText = re.compile('<!--[\s\S]*?-->').sub('',html)
  while innerText.find('>')>=0: # Loop through nested Tags
    text = re.compile('<[^<>]+?>').sub('',innerText)
    if text == innerText:
      break
    innerText = text

  return innerText.strip()
# This is a regex solution.
import re
def removeHtml(html):
  if not html: return html
  # Remove comments first
  innerText = re.compile('<!--[\s\S]*?-->').sub('',html)
  while innerText.find('>')>=0: # Loop through nested Tags
    text = re.compile('<[^<>]+?>').sub('',innerText)
    if text == innerText:
      break
    innerText = text

  return innerText.strip()

回答 24

此方法对我而言完美无缺,不需要其他安装:

import re
import htmlentitydefs

def convertentity(m):
    if m.group(1)=='#':
        try:
            return unichr(int(m.group(2)))
        except ValueError:
            return '&#%s;' % m.group(2)
        try:
            return htmlentitydefs.entitydefs[m.group(2)]
        except KeyError:
            return '&%s;' % m.group(2)

def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convertentity,s)

html =  converthtml(html)
html.replace("&nbsp;", " ") ## Get rid of the remnants of certain formatting(subscript,superscript,etc).

This method works flawlessly for me and requires no additional installations:

import re
import htmlentitydefs

def convertentity(m):
    if m.group(1)=='#':
        try:
            return unichr(int(m.group(2)))
        except ValueError:
            return '&#%s;' % m.group(2)
        try:
            return htmlentitydefs.entitydefs[m.group(2)]
        except KeyError:
            return '&%s;' % m.group(2)

def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convertentity,s)

html =  converthtml(html)
html.replace("&nbsp;", " ") ## Get rid of the remnants of certain formatting(subscript,superscript,etc).

解码Python字符串中的HTML实体?

问题:解码Python字符串中的HTML实体?

我正在使用Beautiful Soup 3解析一些HTML,但是它包含HTML实体,Beautiful Soup 3不会自动为我解码:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

如何解码HTML实体text以获得"£682m"而不是"&pound;682m"

I’m parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn’t automatically decode for me:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

How can I decode the HTML entities in text to get "£682m" instead of "&pound;682m".


回答 0

Python 3.4以上

用途html.unescape()

import html
print(html.unescape('&pound;682m'))

FYI html.parser.HTMLParser.unescape已过时,应该保留在3.5中,尽管它是错误地遗忘的。它将很快从语言中删除。


Python 2.6-3.3

您可以HTMLParser.unescape()从标准库中使用:

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

您还可以使用six兼容性库来简化导入:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

Python 3.4+

Use html.unescape():

import html
print(html.unescape('&pound;682m'))

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.


Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

回答 1

Beautiful Soup处理实体转换。在Beautiful Soup 3中,您需要为构造函数指定convertEntities参数BeautifulSoup(请参见存档文档的“实体转换”部分)。在Beautiful Soup 4中,实体会自动解码。

美丽的汤3

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p682m</p>

美丽汤4

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p682m</p></body></html>

Beautiful Soup handles entity conversion. In Beautiful Soup 3, you’ll need to specify the convertEntities argument to the BeautifulSoup constructor (see the ‘Entity Conversion’ section of the archived docs). In Beautiful Soup 4, entities get decoded automatically.

Beautiful Soup 3

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>

Beautiful Soup 4

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p>£682m</p></body></html>

回答 2

您可以使用w3lib.html库中的replace_entities

In [202]: from w3lib.html import replace_entities

In [203]: replace_entities("&pound;682m")
Out[203]: u'\xa3682m'

In [204]: print replace_entities("&pound;682m")
£682m

You can use replace_entities from w3lib.html library

In [202]: from w3lib.html import replace_entities

In [203]: replace_entities("&pound;682m")
Out[203]: u'\xa3682m'

In [204]: print replace_entities("&pound;682m")
£682m

回答 3

Beautiful Soup 4允许您将格式化程序设置为输出

如果您传入formatter=None,Beautiful Soup将不会在输出上完全修改字符串。这是最快的选项,但可能导致Beautiful Soup生成无效的HTML / XML,如以下示例所示:

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>

Beautiful Soup 4 allows you to set a formatter to your output

If you pass in formatter=None, Beautiful Soup will not modify strings at all on output. This is the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in these examples:

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>

回答 4

我有一个类似的编码问题。我使用了normalize()方法。将数据框导出到另一个目录中的.html文件时,使用pandas .to_html()方法时出现Unicode错误。我最终做到了,它奏效了…

    import unicodedata 

数据框对象可以是任何您喜欢的对象,我们称之为表…

    table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
    table.index+= 1

对表格数据进行编码,以便我们可以将其导出到模板文件夹中的.html文件(可以是您希望的任何位置:))

     #this is where the magic happens
     html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')

将规范化的字符串导出到html文件

    file = open("templates/home.html","w") 

    file.write(html_data) 

    file.close() 

参考:unicodedata文档

I had a similar encoding issue. I used the normalize() method. I was getting a Unicode error using the pandas .to_html() method when exporting my data frame to an .html file in another directory. I ended up doing this and it worked…

    import unicodedata 

The dataframe object can be whatever you like, let’s call it table…

    table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
    table.index+= 1

encode table data so that we can export it to out .html file in templates folder(this can be whatever location you wish :))

     #this is where the magic happens
     html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')

export normalized string to html file

    file = open("templates/home.html","w") 

    file.write(html_data) 

    file.close() 

Reference: unicodedata documentation


回答 5

这可能与这里无关。但是要从整个文档中消除这些html实体,您可以执行以下操作:(假设document = page,请原谅草率的代码,但是如果您有关于如何使其变得更好的想法,我想所有人-我是新手这个)。

import re
import HTMLParser

regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
    h = HTMLParser.HTMLParser()
    unescaped = h.unescape(e) #finds the unescaped value of the html entity
    page = page.replace(e, unescaped) #replaces html entity with unescaped value

This probably isnt relevant here. But to eliminate these html entites from an entire document, you can do something like this: (Assume document = page and please forgive the sloppy code, but if you have ideas as to how to make it better, Im all ears – Im new to this).

import re
import HTMLParser

regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
    h = HTMLParser.HTMLParser()
    unescaped = h.unescape(e) #finds the unescaped value of the html entity
    page = page.replace(e, unescaped) #replaces html entity with unescaped value

如何按类别查找元素

问题:如何按类别查找元素

我在使用Beautifulsoup解析具有“ class”属性的HTML元素时遇到了麻烦。代码看起来像这样

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div["class"] == "stylelistrow"):
        print div

脚本完成后的同一行出现错误。

File "./beautifulcoding.py", line 130, in getlanguage
  if (div["class"] == "stylelistrow"):
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 599, in __getitem__
   return self._getAttrMap()[key]
KeyError: 'class'

我如何摆脱这个错误?

I’m having trouble parsing HTML elements with “class” attribute using Beautifulsoup. The code looks like this

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div["class"] == "stylelistrow"):
        print div

I get an error on the same line “after” the script finishes.

File "./beautifulcoding.py", line 130, in getlanguage
  if (div["class"] == "stylelistrow"):
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 599, in __getitem__
   return self._getAttrMap()[key]
KeyError: 'class'

How do I get rid of this error?


回答 0

您可以使用BS3优化搜索以仅找到具有给定类的那些div:

mydivs = soup.findAll("div", {"class": "stylelistrow"})

You can refine your search to only find those divs with a given class using BS3:

mydivs = soup.findAll("div", {"class": "stylelistrow"})

回答 1

从文档中:

从Beautiful Soup 4.1.2开始,您可以使用关键字arguments通过CSS类进行搜索 class_

soup.find_all("a", class_="sister")

在这种情况下将是:

soup.find_all("div", class_="stylelistrow")

它也适用于:

soup.find_all("div", class_="stylelistrowone stylelistrowtwo")

From the documentation:

As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

soup.find_all("a", class_="sister")

Which in this case would be:

soup.find_all("div", class_="stylelistrow")

It would also work for:

soup.find_all("div", class_="stylelistrowone stylelistrowtwo")

回答 2

更新:2016在最新版本的beautifulsoup中,方法“ findAll”已重命名为“ find_all”。链接到官方文档

因此答案将是

soup.find_all("html_element", class_="your_class_name")

Update: 2016 In the latest version of beautifulsoup, the method ‘findAll’ has been renamed to ‘find_all’. Link to official documentation

Hence the answer will be

soup.find_all("html_element", class_="your_class_name")

回答 3

特定于BeautifulSoup 3:

soup.findAll('div',
             {'class': lambda x: x 
                       and 'stylelistrow' in x.split()
             }
            )

将找到所有这些:

<div class="stylelistrow">
<div class="stylelistrow button">
<div class="button stylelistrow">

Specific to BeautifulSoup 3:

soup.findAll('div',
             {'class': lambda x: x 
                       and 'stylelistrow' in x.split()
             }
            )

Will find all of these:

<div class="stylelistrow">
<div class="stylelistrow button">
<div class="button stylelistrow">

回答 4

直接的方法是:

soup = BeautifulSoup(sdata)
for each_div in soup.findAll('div',{'class':'stylelist'}):
    print each_div

确保您使用findAll的大小写,而不是findall的大小写

A straight forward way would be :

soup = BeautifulSoup(sdata)
for each_div in soup.findAll('div',{'class':'stylelist'}):
    print each_div

Make sure you take of the casing of findAll, its not findall


回答 5

如何按类别查找元素

我在使用Beautifulsoup解析具有“ class”属性的html元素时遇到了麻烦。

您可以轻松地按一个类别查找,但是如果要按两个类别的相交查找,则要困难一些,

文档(添加重点):

如果要搜索与两个或多个 CSS类匹配的标签,则应使用CSS选择器:

css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

为了清楚起见,此操作仅选择既是删除线又是正文类的p标签。

要在一组类中查找任何交集(不是交集,而是联合),可以给class_关键字参数提供一个列表(从4.1.2开始):

soup = BeautifulSoup(sdata)
class_list = ["stylelistrow"] # can add any other classes to this list.
# will find any divs with any names in class_list:
mydivs = soup.find_all('div', class_=class_list) 

还要注意,findAll已从camelCase重命名为更多Pythonic find_all

How to find elements by class

I’m having trouble parsing html elements with “class” attribute using Beautifulsoup.

You can easily find by one class, but if you want to find by the intersection of two classes, it’s a little more difficult,

From the documentation (emphasis added):

If you want to search for tags that match two or more CSS classes, you should use a CSS selector:

css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

To be clear, this selects only the p tags that are both strikeout and body class.

To find for the intersection of any in a set of classes (not the intersection, but the union), you can give a list to the class_ keyword argument (as of 4.1.2):

soup = BeautifulSoup(sdata)
class_list = ["stylelistrow"] # can add any other classes to this list.
# will find any divs with any names in class_list:
mydivs = soup.find_all('div', class_=class_list) 

Also note that findAll has been renamed from the camelCase to the more Pythonic find_all.


回答 6

CSS选择器

单班第一场比赛

soup.select_one('.stylelistrow')

比赛清单

soup.select('.stylelistrow')

复合类(即AND另一类)

soup.select_one('.stylelistrow.otherclassname')
soup.select('.stylelistrow.otherclassname')

复合类名称中的空格例如class = stylelistrow otherclassname用“。”代替。您可以继续添加类。

类列表(或-匹配存在的任何一个

soup.select_one('.stylelistrow, .otherclassname')
soup.select('.stylelistrow, .otherclassname')

bs4 4.7.1 +

innerText包含字符串的特定类

soup.select_one('.stylelistrow:contains("some string")')
soup.select('.stylelistrow:contains("some string")')

具有特定子元素(例如a标签)的特定类

soup.select_one('.stylelistrow:has(a)')
soup.select('.stylelistrow:has(a)')

CSS selectors

single class first match

soup.select_one('.stylelistrow')

list of matches

soup.select('.stylelistrow')

compound class (i.e. AND another class)

soup.select_one('.stylelistrow.otherclassname')
soup.select('.stylelistrow.otherclassname')

Spaces in compound class names e.g. class = stylelistrow otherclassname are replaced with “.”. You can continue to add classes.

list of classes (OR – match whichever present

soup.select_one('.stylelistrow, .otherclassname')
soup.select('.stylelistrow, .otherclassname')

bs4 4.7.1 +

Specific class whose innerText contains a string

soup.select_one('.stylelistrow:contains("some string")')
soup.select('.stylelistrow:contains("some string")')

Specific class which has a certain child element e.g. a tag

soup.select_one('.stylelistrow:has(a)')
soup.select('.stylelistrow:has(a)')

回答 7

从BeautifulSoup 4+开始,

如果您只有一个类名,则只需将类名作为参数传递即可:

mydivs = soup.find_all('div', 'class_name')

或者,如果您有多个类名,只需将类名列表作为参数传递即可:

mydivs = soup.find_all('div', ['class1', 'class2'])

As of BeautifulSoup 4+ ,

If you have a single class name , you can just pass the class name as parameter like :

mydivs = soup.find_all('div', 'class_name')

Or if you have more than one class names , just pass the list of class names as parameter like :

mydivs = soup.find_all('div', ['class1', 'class2'])

回答 8

尝试首先检查div是否具有class属性,如下所示:

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs:
    if "class" in div:
        if (div["class"]=="stylelistrow"):
            print div

Try to check if the div has a class attribute first, like this:

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs:
    if "class" in div:
        if (div["class"]=="stylelistrow"):
            print div

回答 9

这对我来说可以访问class属性(在beautifulsoup 4上,与文档中所说的相反)。KeyError会返回一个列表,而不是字典。

for hit in soup.findAll(name='span'):
    print hit.contents[1]['class']

This works for me to access the class attribute (on beautifulsoup 4, contrary to what the documentation says). The KeyError comes a list being returned not a dictionary.

for hit in soup.findAll(name='span'):
    print hit.contents[1]['class']

回答 10

以下对我有用

a_tag = soup.find_all("div",class_='full tabpublist')

the following worked for me

a_tag = soup.find_all("div",class_='full tabpublist')

回答 11

这为我工作:

for div in mydivs:
    try:
        clazz = div["class"]
    except KeyError:
        clazz = ""
    if (clazz == "stylelistrow"):
        print div

This worked for me:

for div in mydivs:
    try:
        clazz = div["class"]
    except KeyError:
        clazz = ""
    if (clazz == "stylelistrow"):
        print div

回答 12

或者,我们可以使用lxml,它支持xpath并且非常快!

from lxml import html, etree 

attr = html.fromstring(html_text)#passing the raw html
handles = attr.xpath('//div[@class="stylelistrow"]')#xpath exresssion to find that specific class

for each in handles:
    print(etree.tostring(each))#printing the html as string

Alternatively we can use lxml, it support xpath and very fast!

from lxml import html, etree 

attr = html.fromstring(html_text)#passing the raw html
handles = attr.xpath('//div[@class="stylelistrow"]')#xpath exresssion to find that specific class

for each in handles:
    print(etree.tostring(each))#printing the html as string

回答 13

这应该工作:

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div.find(class_ == "stylelistrow"):
        print div

This should work:

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div.find(class_ == "stylelistrow"):
        print div

回答 14

其他答案对我不起作用。

在其他答案中,findAll它被用于汤对象本身,但是我需要一种方法,可以通过对从我做完之后获得的对象中提取的特定元素内的对象进行类名查找findAll

如果您要在嵌套的HTML元素中进行搜索以按类名获取对象,请尝试以下操作-

# parse html
page_soup = soup(web_page.read(), "html.parser")

# filter out items matching class name
all_songs = page_soup.findAll("li", "song_item")

# traverse through all_songs
for song in all_songs:

    # get text out of span element matching class 'song_name'
    # doing a 'find' by class name within a specific song element taken out of 'all_songs' collection
    song.find("span", "song_name").text

注意事项:

  1. 我没有明确定义要在’class’属性上进行搜索findAll("li", {"class": "song_item"}),因为它是我要搜索的唯一属性,并且如果您不专门指出要在哪个属性上查找,默认情况下它将搜索class属性。

  2. 当您执行findAllor或时find,生成的对象属于的bs4.element.ResultSet子类list。您可以ResultSet在任意数量的嵌套元素(只要类型为ResultSet)内利用的所有方法进行查找或全部查找。

  3. 我的BS4版本-4.9.1,Python版本-3.8.1

Other answers did not work for me.

In other answers the findAll is being used on the soup object itself, but I needed a way to do a find by class name on objects inside a specific element extracted from the object I obtained after doing findAll.

If you are trying to do a search inside nested HTML elements to get objects by class name, try below –

# parse html
page_soup = soup(web_page.read(), "html.parser")

# filter out items matching class name
all_songs = page_soup.findAll("li", "song_item")

# traverse through all_songs
for song in all_songs:

    # get text out of span element matching class 'song_name'
    # doing a 'find' by class name within a specific song element taken out of 'all_songs' collection
    song.find("span", "song_name").text

Points to note:

  1. I’m not explicitly defining the search to be on ‘class’ attribute findAll("li", {"class": "song_item"}), since it’s the only attribute I’m searching on and it will by default search for class attribute if you don’t exclusively tell which attribute you want to find on.

  2. When you do a findAll or find, the resulting object is of class bs4.element.ResultSet which is a subclass of list. You can utilize all methods of ResultSet, inside any number of nested elements (as long as they are of type ResultSet) to do a find or find all.

  3. My BS4 version – 4.9.1, Python version – 3.8.1


回答 15

以下应该工作

soup.find('span', attrs={'class':'totalcount'})

将“ totalcount”替换为您的类的名称,并将“ span”替换为您要查找的标签。另外,如果您的Class包含多个带空格的名称,只需选择一个并使用即可。

PS这将找到具有给定条件的第一个元素。如果要查找所有元素,则将“ find”替换为“ find_all”。

The following should work

soup.find('span', attrs={'class':'totalcount'})

replace ‘totalcount’ with your class name and ‘span’ with tag you are looking for. Also, if your class contains multiple names with space, just choose one and use.

P.S. This finds the first element with given criteria. If you want to find all elements then replace ‘find’ with ‘find_all’.


Js-beautify-javascript 格式化工具

JS 格式化

这个小美化器将重新格式化和重新缩进bookmarklet、丑陋的JavaScript、解压由Dean Edward的流行打包程序打包的脚本,以及对由NPM包处理的脚本进行部分去模糊处理javascript-obfuscator

我把这个放在最前面和中心,因为现有的业主目前在这个项目上的工作时间非常有限。这是一个很受欢迎并被广泛使用的项目,但它迫切需要有时间致力于修复面向客户的错误以及内部设计和实现的潜在问题的贡献者

如果您有兴趣,请看一下CONTRIBUTING.md然后修复标有“Good first issue”贴上标签并提交请购单。尽可能多地重复。谢谢!

安装

您可以安装node.js或python的美化器

Node.js JavaScript

您可以安装npm包。js-beautify全局安装时,它提供一个可执行文件js-beautify剧本。与Python脚本一样,美化结果被发送到stdout除非另有配置,否则

$ npm -g install js-beautify
$ js-beautify foo.js

您还可以使用js-beautify作为一个node库(本地安装,npm默认值):

$ npm install js-beautify

Node.js JavaScript(VNext)

以上安装了最新的稳定版本。要安装测试版或RC版,请执行以下操作:

$ npm install js-beautify@next

Web库

美容师可以作为Web库添加到您的页面上

JS美颜托管在两个CDN服务上:cdnjs和生菜

要从这些服务之一提取最新版本,请在您的文档中包含以下一组脚本标记:

“>
<script src="https://cdnjs.cloudflare.com/ajax/libs/js-beautify/1.14.0/beautify.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/js-beautify/1.14.0/beautify-css.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/js-beautify/1.14.0/beautify-html.js"></script>

<script src="https://cdnjs.cloudflare.com/ajax/libs/js-beautify/1.14.0/beautify.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/js-beautify/1.14.0/beautify-css.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/js-beautify/1.14.0/beautify-html.min.js"></script>

<script src="https://cdn.rawgit.com/beautify-web/js-beautify/v1.14.0/js/lib/beautify.js"></script>
<script src="https://cdn.rawgit.com/beautify-web/js-beautify/v1.14.0/js/lib/beautify-css.js"></script>
<script src="https://cdn.rawgit.com/beautify-web/js-beautify/v1.14.0/js/lib/beautify-html.js"></script>

通过更改版本号可以使用较旧的版本

免责声明:这些都是免费服务,所以有no uptime or support guarantees

python

要安装Python版本的美容器,请执行以下操作:

$ pip install jsbeautifier

与JavaScript版本不同,Python版本只能重新格式化JavaScript。它不适用于HTML或CSS文件,但您可以安装CSS-美化对于CSS:

$ pip install cssbeautifier

用法

您可以在Web浏览器中使用JS美化器美化javascript,也可以在命令行上使用node.js或python美化javascript

Web浏览器

打开beautifier.io选项通过UI提供

Web库

上面的脚本标记公开了三个函数:js_beautifycss_beautify,以及html_beautify

Node.js JavaScript

全局安装时,美容器提供一个可执行文件js-beautify剧本。美化结果发送到stdout除非另有配置,否则

$ js-beautify foo.js

要使用js-beautify作为一个node库(在本地安装之后),为javascript(Js)、CSS或HTML导入并调用适当的Beautifier方法。所有三个方法签名都是beautify(code, options)code是要美化的代码字符串。Options是一个具有您希望用来美化代码的设置的对象

配置选项名称与CLI名称相同,但使用下划线而不是破折号。例如,--indent-size 2 --space-in-empty-paren会是{ indent_size: 2, space_in_empty_paren: true }

var beautify = require('js-beautify').js,
    fs = require('fs');

fs.readFile('foo.js', 'utf8', function (err, data) {
    if (err) {
        throw err;
    }
    console.log(beautify(data, { indent_size: 2, space_in_empty_paren: true }));
});

python

安装后,要使用Python进行美化,请执行以下操作:

$ js-beautify file.js

美化的产出流向stdout默认情况下,

要使用jsbeautifier因为库很简单:

import jsbeautifier
res = jsbeautifier.beautify('your javascript string')
res = jsbeautifier.beautify_file('some_file.js')

或者,要指定一些选项,请执行以下操作:

opts = jsbeautifier.default_options()
opts.indent_size = 2
opts.space_in_empty_paren = True
res = jsbeautifier.beautify('some javascript', opts)

配置选项名称与CLI名称相同,但使用下划线而不是破折号。上面的示例将在命令行上设置为--indent-size 2 --space-in-empty-paren

选项

以下是Python和JS脚本的命令行标志:

CLI Options:
  -f, --file       Input file(s) (Pass '-' for stdin)
  -r, --replace    Write output in-place, replacing input
  -o, --outfile    Write output to file (default stdout)
  --config         Path to config file
  --type           [js|css|html] ["js"] Select beautifier type (NOTE: Does *not* filter files, only defines which beautifier type to run)
  -q, --quiet      Suppress logging to stdout
  -h, --help       Show this help
  -v, --version    Show the version

Beautifier Options:
  -s, --indent-size                 Indentation size [4]
  -c, --indent-char                 Indentation character [" "]
  -t, --indent-with-tabs            Indent with tabs, overrides -s and -c
  -e, --eol                         Character(s) to use as line terminators.
                                    [first newline in file, otherwise "\n]
  -n, --end-with-newline            End output with newline
  --editorconfig                    Use EditorConfig to set up the options
  -l, --indent-level                Initial indentation level [0]
  -p, --preserve-newlines           Preserve line-breaks (--no-preserve-newlines disables)
  -m, --max-preserve-newlines       Number of line-breaks to be preserved in one chunk [10]
  -P, --space-in-paren              Add padding spaces within paren, ie. f( a, b )
  -E, --space-in-empty-paren        Add a single space inside empty paren, ie. f( )
  -j, --jslint-happy                Enable jslint-stricter mode
  -a, --space-after-anon-function   Add a space before an anonymous function's parens, ie. function ()
  --space-after-named-function      Add a space before a named function's parens, i.e. function example ()
  -b, --brace-style                 [collapse|expand|end-expand|none][,preserve-inline] [collapse,preserve-inline]
  -u, --unindent-chained-methods    Don't indent chained method calls
  -B, --break-chained-methods       Break chained method calls across subsequent lines
  -k, --keep-array-indentation      Preserve array indentation
  -x, --unescape-strings            Decode printable characters encoded in xNN notation
  -w, --wrap-line-length            Wrap lines that exceed N characters [0]
  -X, --e4x                         Pass E4X xml literals through untouched
  --good-stuff                      Warm the cockles of Crockford's heart
  -C, --comma-first                 Put commas at the beginning of new line instead of end
  -O, --operator-position           Set operator position (before-newline|after-newline|preserve-newline) [before-newline]
  --indent-empty-lines              Keep indentation on empty lines
  --templating                      List of templating languages (auto,django,erb,handlebars,php,smarty) ["auto"] auto = none in JavaScript, all in html

它们对应于两个库接口的带下划线的选项键

每个CLI选项的默认值

{
    "indent_size": 4,
    "indent_char": " ",
    "indent_with_tabs": false,
    "editorconfig": false,
    "eol": "\n",
    "end_with_newline": false,
    "indent_level": 0,
    "preserve_newlines": true,
    "max_preserve_newlines": 10,
    "space_in_paren": false,
    "space_in_empty_paren": false,
    "jslint_happy": false,
    "space_after_anon_function": false,
    "space_after_named_function": false,
    "brace_style": "collapse",
    "unindent_chained_methods": false,
    "break_chained_methods": false,
    "keep_array_indentation": false,
    "unescape_strings": false,
    "wrap_line_length": 0,
    "e4x": false,
    "comma_first": false,
    "operator_position": "before-newline",
    "indent_empty_lines": false,
    "templating": ["auto"]
}

CLI中未显示的默认值

{
  "eval_code": false,
  "space_before_conditional": true
}

请注意,并非所有默认值都通过CLI显示。从历史上看,Python和JSAPI并不是100%相同的。还有一些其他情况使我们无法实现100%的API兼容性

从环境或.jsBeaufyrc加载设置(仅限JavaScript)

除了CLI参数之外,您还可以通过以下方式将配置传递给JS可执行文件:

  • 任何jsbeautify_-带前缀的环境变量
  • 一个JSON-由指示的格式化文件--config参数
  • 一个.jsbeautifyrc包含以下内容的文件JSON以上文件系统的任何级别的数据$PWD

此堆栈中前面提供的配置源将覆盖后面的配置源

设置继承和特定于语言的重写

这些设置是一个浅树,其值对于所有语言都是继承值,但可以被覆盖。这适用于在任一实现中直接传递给API的设置。在Javascript实现中,从配置文件(如.jsBeaufyrc)加载的设置也可以使用继承/覆盖

下面是一个示例配置树,显示了语言覆盖节点的所有支持位置。我们将使用indent_size要讨论此配置的行为方式,但可以继承或覆盖任意数量的设置,请执行以下操作:

{
    "indent_size": 4,
    "html": {
        "end_with_newline": true,
        "js": {
            "indent_size": 2
        },
        "css": {
            "indent_size": 2
        }
    },
    "css": {
        "indent_size": 1
    },
    "js": {
       "preserve-newlines": true
    }
}

使用上述示例将产生以下结果:

  • HTML文件
    • 继承indent_size从顶层设置开始,共4个空间
    • 这些文件还将以换行符结尾
    • HTML中的JavaScript和CSS
      • 继承HTMLend_with_newline设置
      • 将它们的缩进覆盖为2个空格
  • CSS文件
    • 将顶级设置重写为indent_size共1个空间
  • JavaScript文件
    • 继承indent_size从顶层设置开始,共4个空间
    • 设置preserve-newlinestrue

CSS和HTML

除了js-beautify可执行文件,css-beautifyhtml-beautify还提供了进入这些脚本的简单界面。或者,js-beautify --cssjs-beautify --html将分别完成相同的任务

// Programmatic access
var beautify_js = require('js-beautify'); // also available under "js" export
var beautify_css = require('js-beautify').css;
var beautify_html = require('js-beautify').html;

// All methods accept two arguments, the string to be beautified, and an options object.

CSS和HTML美化程序在范围上要简单得多,并且拥有的选项要少得多

CSS Beautifier Options:
  -s, --indent-size                  Indentation size [4]
  -c, --indent-char                  Indentation character [" "]
  -t, --indent-with-tabs             Indent with tabs, overrides -s and -c
  -e, --eol                          Character(s) to use as line terminators. (default newline - "\\n")
  -n, --end-with-newline             End output with newline
  -b, --brace-style                  [collapse|expand] ["collapse"]
  -L, --selector-separator-newline   Add a newline between multiple selectors
  -N, --newline-between-rules        Add a newline between CSS rules
  --indent-empty-lines               Keep indentation on empty lines

HTML Beautifier Options:
  -s, --indent-size                  Indentation size [4]
  -c, --indent-char                  Indentation character [" "]
  -t, --indent-with-tabs             Indent with tabs, overrides -s and -c
  -e, --eol                          Character(s) to use as line terminators. (default newline - "\\n")
  -n, --end-with-newline             End output with newline
  -p, --preserve-newlines            Preserve existing line-breaks (--no-preserve-newlines disables)
  -m, --max-preserve-newlines        Maximum number of line-breaks to be preserved in one chunk [10]
  -I, --indent-inner-html            Indent <head> and <body> sections. Default is false.
  -b, --brace-style                  [collapse-preserve-inline|collapse|expand|end-expand|none] ["collapse"]
  -S, --indent-scripts               [keep|separate|normal] ["normal"]
  -w, --wrap-line-length             Maximum characters per line (0 disables) [250]
  -A, --wrap-attributes              Wrap attributes to new lines [auto|force|force-aligned|force-expand-multiline|aligned-multiple|preserve|preserve-aligned] ["auto"]
  -i, --wrap-attributes-indent-size  Indent wrapped attributes to after N characters [indent-size] (ignored if wrap-attributes is "aligned")
  -d, --inline                       List of tags to be considered inline tags
  -U, --unformatted                  List of tags (defaults to inline) that should not be reformatted
  -T, --content_unformatted          List of tags (defaults to pre) whose content should not be reformatted
  -E, --extra_liners                 List of tags (defaults to [head,body,/html] that should have an extra newline before them.
  --editorconfig                     Use EditorConfig to set up the options
  --indent_scripts                   Sets indent level inside script tags ("normal", "keep", "separate")
  --unformatted_content_delimiter    Keep text content together between this string [""]
  --indent-empty-lines               Keep indentation on empty lines
  --templating                       List of templating languages (auto,none,django,erb,handlebars,php,smarty) ["auto"] auto = none in JavaScript, all in html

指令

指令允许您从源文件中控制美化器的行为。指令放在文件内部的注释中。指令的格式为/* beautify {name}:{value} */用CSS和JavaScript编写。在HTML中,它们的格式为<!-- beautify {name}:{value} -->

忽略指令

这个ignore指令使美化器完全忽略文件的一部分,将其视为不解析的文字文本。美化后以下输入保持不变:

// Use ignore when the content is not parsable in the current language, JavaScript in this case.var a =  1;/* beautify ignore:start */ {This is some strange{template language{using open-braces?/* beautify ignore:end */

保留指令

注意:此指令仅适用于HTML和JavaScript,不适用于CSS

这个preserve指令使美化器解析,然后保留一段代码的现有格式

美化后以下输入保持不变:

// Use preserve when the content is valid syntax in the current language, JavaScript in this case.
// This will parse the code and preserve the existing formatting.
/* beautify preserve:start */
{
    browserName: 'internet explorer',
    platform:    'Windows 7',
    version:     '8'
}
/* beautify preserve:end */

许可证

您可以自由地以任何您想要的方式使用它,以防您觉得它对您有用或有效,但您必须保留版权声明和许可证。(麻省理工学院)

学分

还要感谢杰森·戴蒙德、帕特里克·霍夫、诺姆·索松科、安德烈亚斯·施耐德、戴夫·瓦西列夫斯基、维塔利·巴特马诺夫、罗恩·鲍德温、加布里埃尔·哈里森、克里斯·J·舒尔、马蒂亚斯·拜恩斯、维托里奥·甘巴莱塔等人

(Readme.md:js-Beautify@1.14.0)

Pandas实战教程:将Excel转为html格式

大家谈及用Pandas导出数据,应该就会想到to.xxx系列的函数。

这其中呢,比较常用的就是pd.to_csv()pd.to_excel()。但其实还可以将其导成Html网页格式,这里用到的函数就是pd.to_html()

读取Excel

今天我们要实现Excel转为html格式,首先需要用读取Excel中的表格数据。

import pandas as pd
data = pd.read_excel('测试.xlsx')

查看数据

data.head()

下面我们来学习把DataFrame转换成HTML表格的方法。

生成Html

to_html()函数可以直接把DataFrame转换成HTML表格,只需一行代码即可实现:

html_table = data.to_html('测试.html')

运行上面代码后,工作目录中多了测试.html文件,使用网页浏览器打开它,显示内容如下👇

print(data.to_html())

通过print打印,可以看到DataFrame的内部结构被自动转换为嵌入在表格中的<TH>,<TR>,<TD>标签,保留所有内部层级结构。

调整格式

我们还可以自定义修改参数,来调整生成HTML的格式。

html_table = data.to_html('测试.html',header = True,index = False,justify='center')

再次打开新生成的测试.html文件,发现格式已经发生了变化。

如果想对格式进行进一步调整(增加标题、修改颜色等),就需要一些HTML知识了,可以对生成的测试.html文件中的文本进行调整。

对于有些小伙伴可能需要进行页面展示,就要搭配Flask库来使用了。

小结

Pandas提供read_html()to_html()两个函数用于读写html格式的文件。这两个函数非常有用,一个轻松将DataFrame等复杂的数据结构转换成HTML表格;另一个不用复杂爬虫,简单几行代码即可抓取Table表格型数据,简直是个神器!

今天篇幅很短,主要讲了Pandas中to_html()这个函数。使用该函数最大的优点是:我们在不了解html知识的情况下,就能生成一个表格型的HTML。本文转自快学Python

我们的文章到此就结束啦,如果你喜欢今天的 Python 教程,请持续关注Python实用宝典。

有任何问题,可以在公众号后台回复:加群,回答相应验证信息,进入互助群询问。

原创不易,希望你能在下面点个赞和在看支持我继续创作,谢谢!

给作者打赏,选择打赏金额
¥1¥5¥10¥20¥50¥100¥200 自定义

​Python实用宝典 ( pythondict.com )
不只是一个宝典
欢迎关注公众号:Python实用宝典