标签归档:parsing

使用beautifulsoup提取属性值

问题:使用beautifulsoup提取属性值

我试图在网页上的特定“输入”标签中提取单个“值”属性的内容。我使用以下代码:

import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)

inputTag = soup.findAll(attrs={"name" : "stainfo"})

output = inputTag['value']

print str(output)

我收到TypeError:列表索引必须是整数,而不是str

即使从Beautifulsoup文档中我了解到字符串在这里也不应该成为问题…但是我没有专家,我可能会误解了。

任何建议,不胜感激!提前致谢。

I am trying to extract the content of a single “value” attribute in a specific “input” tag on a webpage. I use the following code:

import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)

inputTag = soup.findAll(attrs={"name" : "stainfo"})

output = inputTag['value']

print str(output)

I get a TypeError: list indices must be integers, not str

even though from the Beautifulsoup documentation i understand that strings should not be a problem here… but i a no specialist and i may have misunderstood.

Any suggestion is greatly appreciated! Thanks in advance.


回答 0

.find_all() 返回所有找到的元素的列表,因此:

input_tag = soup.find_all(attrs={"name" : "stainfo"})

input_tag是一个列表(可能仅包含一个元素)。根据您的确切要求,您应该执行以下任一操作:

 output = input_tag[0]['value']

或使用.find()仅返回一个(第一个)找到的元素的方法:

 input_tag = soup.find(attrs={"name": "stainfo"})
 output = input_tag['value']

.find_all() returns list of all found elements, so:

input_tag = soup.find_all(attrs={"name" : "stainfo"})

input_tag is a list (probably containing only one element). Depending on what you want exactly you either should do:

output = input_tag[0]['value']

or use .find() method which returns only one (first) found element:

input_tag = soup.find(attrs={"name": "stainfo"})
output = input_tag['value']

回答 1

在中Python 3.x,只需get(attr_name)在您使用的标签对象上使用find_all

xmlData = None

with open('conf//test1.xml', 'r') as xmlFile:
    xmlData = xmlFile.read()

xmlDecoded = xmlData

xmlSoup = BeautifulSoup(xmlData, 'html.parser')

repElemList = xmlSoup.find_all('repeatingelement')

for repElem in repElemList:
    print("Processing repElem...")
    repElemID = repElem.get('id')
    repElemName = repElem.get('name')

    print("Attribute id = %s" % repElemID)
    print("Attribute name = %s" % repElemName)

针对如下的XML文件conf//test1.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
    <singleElement>
        <subElementX>XYZ</subElementX>
    </singleElement>
    <repeatingElement id="11" name="Joe"/>
    <repeatingElement id="12" name="Mary"/>
</root>

印刷品:

Processing repElem...
Attribute id = 11
Attribute name = Joe
Processing repElem...
Attribute id = 12
Attribute name = Mary

In Python 3.x, simply use get(attr_name) on your tag object that you get using find_all:

xmlData = None

with open('conf//test1.xml', 'r') as xmlFile:
    xmlData = xmlFile.read()

xmlDecoded = xmlData

xmlSoup = BeautifulSoup(xmlData, 'html.parser')

repElemList = xmlSoup.find_all('repeatingelement')

for repElem in repElemList:
    print("Processing repElem...")
    repElemID = repElem.get('id')
    repElemName = repElem.get('name')

    print("Attribute id = %s" % repElemID)
    print("Attribute name = %s" % repElemName)

against XML file conf//test1.xml that looks like:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
    <singleElement>
        <subElementX>XYZ</subElementX>
    </singleElement>
    <repeatingElement id="11" name="Joe"/>
    <repeatingElement id="12" name="Mary"/>
</root>

prints:

Processing repElem...
Attribute id = 11
Attribute name = Joe
Processing repElem...
Attribute id = 12
Attribute name = Mary

回答 2

如果要从上面的源中检索属性的多个值,可以使用findAll和列表推导来获取所需的一切:

import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)

inputTags = soup.findAll(attrs={"name" : "stainfo"})
### You may be able to do findAll("input", attrs={"name" : "stainfo"})

output = [x["stainfo"] for x in inputTags]

print output
### This will print a list of the values.

If you want to retrieve multiple values of attributes from the source above, you can use findAll and a list comprehension to get everything you need:

import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)

inputTags = soup.findAll(attrs={"name" : "stainfo"})
### You may be able to do findAll("input", attrs={"name" : "stainfo"})

output = [x["stainfo"] for x in inputTags]

print output
### This will print a list of the values.

回答 3

我实际上建议您使用一种节省时间的方法,假设您知道哪种标签具有这些属性。

假设标签xyz的attritube名为“ staininfo”。

full_tag = soup.findAll("xyz")

而且我不希望您知道full_tag是一个列表

for each_tag in full_tag:
    staininfo_attrb_value = each_tag["staininfo"]
    print staininfo_attrb_value

因此,您可以获得所有标签xyz的staininfo的所有attrb值

I would actually suggest you a time saving way to go with this assuming that you know what kind of tags have those attributes.

suppose say a tag xyz has that attritube named “staininfo”..

full_tag = soup.findAll("xyz")

And i wan’t you to understand that full_tag is a list

for each_tag in full_tag:
    staininfo_attrb_value = each_tag["staininfo"]
    print staininfo_attrb_value

Thus you can get all the attrb values of staininfo for all the tags xyz


回答 4

您也可以使用:

import requests
from bs4 import BeautifulSoup
import csv

url = "http://58.68.130.147/"
r = requests.get(url)
data = r.text

soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("input", attrs={"name":"stainfo"})

for val in get_details:
    get_val = val["value"]
    print(get_val)

you can also use this :

import requests
from bs4 import BeautifulSoup
import csv

url = "http://58.68.130.147/"
r = requests.get(url)
data = r.text

soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("input", attrs={"name":"stainfo"})

for val in get_details:
    get_val = val["value"]
    print(get_val)

回答 5

我在Beautifulsoup 4.8.1中使用它来获取某些元素的所有类属性的值:

from bs4 import BeautifulSoup

html = "<td class='val1'/><td col='1'/><td class='val2' />"

bsoup = BeautifulSoup(html, 'html.parser')

for td in bsoup.find_all('td'):
    if td.has_attr('class'):
        print(td['class'][0])

重要的是要注意,即使属性只有一个值,属性键也会检索列表。

I am using this with Beautifulsoup 4.8.1 to get the value of all class attributes of certain elements:

from bs4 import BeautifulSoup

html = "<td class='val1'/><td col='1'/><td class='val2' />"

bsoup = BeautifulSoup(html, 'html.parser')

for td in bsoup.find_all('td'):
    if td.has_attr('class'):
        print(td['class'][0])

Its important to note that the attribute key retrieves a list even when the attribute has only a single value.


如何获取URL中最后一个斜杠之后的所有内容?

问题:如何获取URL中最后一个斜杠之后的所有内容?

如何提取Python中URL中最后一个斜杠之后的内容?例如,这些URL应该返回以下内容:

URL: http://www.test.com/TEST1
returns: TEST1

URL: http://www.test.com/page/TEST2
returns: TEST2

URL: http://www.test.com/page/page/12345
returns: 12345

我已经尝试过urlparse,但这给了我完整的路径文件名,例如page/page/12345

How can I extract whatever follows the last slash in a URL in Python? For example, these URLs should return the following:

URL: http://www.test.com/TEST1
returns: TEST1

URL: http://www.test.com/page/TEST2
returns: TEST2

URL: http://www.test.com/page/page/12345
returns: 12345

I’ve tried urlparse, but that gives me the full path filename, such as page/page/12345.


回答 0

您不需要花哨的东西,只需在标准库中查看字符串方法,就可以轻松地在“ filename”部分和其余部分之间拆分url:

url.rsplit('/', 1)

因此,您可以简单地通过以下方式获得您感兴趣的部分:

url.rsplit('/', 1)[-1]

You don’t need fancy things, just see the string methods in the standard library and you can easily split your url between ‘filename’ part and the rest:

url.rsplit('/', 1)

So you can get the part you’re interested in simply with:

url.rsplit('/', 1)[-1]

回答 1

另一种(惯用的)方式:

URL.split("/")[-1]

One more (idio(ma)tic) way:

URL.split("/")[-1]

回答 2

rsplit 应该完成任务:

In [1]: 'http://www.test.com/page/TEST2'.rsplit('/', 1)[1]
Out[1]: 'TEST2'

rsplit should be up to the task:

In [1]: 'http://www.test.com/page/TEST2'.rsplit('/', 1)[1]
Out[1]: 'TEST2'

回答 3

您可以这样:

head, tail = os.path.split(url)

其中tail是您的文件名。

You can do like this:

head, tail = os.path.split(url)

Where tail will be your file name.


回答 4

如果需要,可以使用urlparse(例如,摆脱任何查询字符串参数)。

import urllib.parse

urls = [
    'http://www.test.com/TEST1',
    'http://www.test.com/page/TEST2',
    'http://www.test.com/page/page/12345',
    'http://www.test.com/page/page/12345?abc=123'
]

for i in urls:
    url_parts = urllib.parse.urlparse(i)
    path_parts = url_parts[2].rpartition('/')
    print('URL: {}\nreturns: {}\n'.format(i, path_parts[2]))

输出:

URL: http://www.test.com/TEST1
returns: TEST1

URL: http://www.test.com/page/TEST2
returns: TEST2

URL: http://www.test.com/page/page/12345
returns: 12345

URL: http://www.test.com/page/page/12345?abc=123
returns: 12345

urlparse is fine to use if you want to (say, to get rid of any query string parameters).

import urllib.parse

urls = [
    'http://www.test.com/TEST1',
    'http://www.test.com/page/TEST2',
    'http://www.test.com/page/page/12345',
    'http://www.test.com/page/page/12345?abc=123'
]

for i in urls:
    url_parts = urllib.parse.urlparse(i)
    path_parts = url_parts[2].rpartition('/')
    print('URL: {}\nreturns: {}\n'.format(i, path_parts[2]))

Output:

URL: http://www.test.com/TEST1
returns: TEST1

URL: http://www.test.com/page/TEST2
returns: TEST2

URL: http://www.test.com/page/page/12345
returns: 12345

URL: http://www.test.com/page/page/12345?abc=123
returns: 12345

回答 5

os.path.basename(os.path.normpath('/folderA/folderB/folderC/folderD/'))
>>> folderD
os.path.basename(os.path.normpath('/folderA/folderB/folderC/folderD/'))
>>> folderD

回答 6

这是更通用的正则表达式方法:

    re.sub(r'^.+/([^/]+)$', r'\1', url)

Here’s a more general, regex way of doing this:

    re.sub(r'^.+/([^/]+)$', r'\1', url)

回答 7

extracted_url = url[url.rfind("/")+1:];
extracted_url = url[url.rfind("/")+1:];

回答 8

partition并且rpartition对于此类事情也很方便:

url.rpartition('/')[2]

First extract the path element from the URL:

from urllib.parse import urlparse
parsed= urlparse('https://www.dummy.example/this/is/PATH?q=/a/b&r=5#asx')

and then you can extract the last segment with string functions:

parsed.path.rpartition('/')[2]

(example resulting to 'PATH')


回答 9

分割网址并弹出最后一个元素 url.split('/').pop()

Split the url and pop the last element url.split('/').pop()


回答 10

url ='http://www.test.com/page/TEST2'.split('/')[4]
print url

输出:TEST2

url ='http://www.test.com/page/TEST2'.split('/')[4]
print url

Output: TEST2.


Python / Json:期望属性名称用双引号引起来

问题:Python / Json:期望属性名称用双引号引起来

我一直在尝试找到一种在Python中加载JSON对象的好方法。我发送此json数据:

{'http://example.org/about': {'http://purl.org/dc/terms/title': [{'type': 'literal', 'value': "Anna's Homepage"}]}}

到后端,它将在这里作为字符串接收,然后我用来json.loads(data)解析它。

但是每次我遇到同样的异常:

ValueError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

我用谷歌搜索,但是除了这个解决方案之外json.loads(json.dumps(data))似乎什么都没用,在我个人看来,这种解决方案效率不高,因为它接受任何类型的数据,即使不是json格式的数据也是如此。

任何建议将不胜感激。

I’ve been trying to figure out a good way to load JSON objects in Python. I send this json data:

{'http://example.org/about': {'http://purl.org/dc/terms/title': [{'type': 'literal', 'value': "Anna's Homepage"}]}}

to the backend where it will be received as a string then I used json.loads(data) to parse it.

But each time I got the same exception :

ValueError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

I googled it but nothing seems to work besides this solution json.loads(json.dumps(data)) which personally seems for me not that efficient since it accept any kind of data even the ones that are not in json format.

Any suggestions will be much appreciated.


回答 0

这个:

{'http://example.org/about': {'http://purl.org/dc/terms/title': [{'type': 'literal', 'value': "Anna's Homepage"}]}}

不是JSON。
这个:

{"http://example.org/about": {"http://purl.org/dc/terms/title": [{"type": "literal", "value": "Anna's Homepage"}]}}

是JSON。

编辑:
一些评论者建议以上是不够的。
JSON规范-RFC7159声明字符串以引号开头和结尾。那是"
单一仲裁'在JSON中没有语义,只能在字符串中使用。

This:

{'http://example.org/about': {'http://purl.org/dc/terms/title': [{'type': 'literal', 'value': "Anna's Homepage"}]}}

is not JSON.
This:

{"http://example.org/about": {"http://purl.org/dc/terms/title": [{"type": "literal", "value": "Anna's Homepage"}]}}

is JSON.

EDIT:
Some commenters suggested that the above is not enough.
JSON specification – RFC7159 states that a string begins and ends with quotation mark. That is ".
Single quoute ' has no semantic meaning in JSON and is allowed only inside a string.


回答 1

因为JSON仅允许使用双引号将字符串引起来,所以您可以像这样操作字符串:

str = str.replace("\'", "\"")

如果您的JSON包含转义的单引号(\'),则应使用更精确的以下代码:

import re
p = re.compile('(?<!\\\\)'')
str = p.sub('\"', str)

这将用JSON字符串str中的双引号替换所有出现的单引号,在后一种情况下将不会替换转义的单引号。

您还可以使用js-beautify不太严格的方法:

$ pip install jsbeautifier
$ js-beautify file.js

as JSON only allows enclosing strings with double quotes you can manipulate the string like this:

str = str.replace("\'", "\"")

if your JSON holds escaped single-quotes (\') then you should use the more precise following code:

import re
p = re.compile('(?<!\\\\)\'')
str = p.sub('\"', str)

This will replace all occurrences of single quote with double quote in the JSON string str and in the latter case will not replace escaped single-quotes.

You can also use js-beautify which is less strict:

$ pip install jsbeautifier
$ js-beautify file.js

回答 2

就我而言,双引号不是问题。

上一个逗号给了我同样的错误信息。

{'a':{'b':c,}}
           ^

为了删除该逗号,我编写了一些简单的代码。

import json

with open('a.json','r') as f:
    s = f.read()
    s = s.replace('\t','')
    s = s.replace('\n','')
    s = s.replace(',}','}')
    s = s.replace(',]',']')
    data = json.loads(s)

这对我有用。

In my case, double quotes was not a problem.

Last comma gave me same error message.

{'a':{'b':c,}}
           ^

To remove this comma, I wrote some simple code.

import json

with open('a.json','r') as f:
    s = f.read()
    s = s.replace('\t','')
    s = s.replace('\n','')
    s = s.replace(',}','}')
    s = s.replace(',]',']')
    data = json.loads(s)

And this worked for me.


回答 3

很简单,该字符串不是有效的JSON。如错误所述,JSON文档需要使用双引号。

您需要修复数据源。

Quite simply, that string is not valid JSON. As the error says, JSON documents need to use double quotes.

You need to fix the source of the data.


回答 4

我检查了您的JSON数据

{'http://example.org/about': {'http://purl.org/dc/terms/title': [{'type': 'literal', 'value': "Anna's Homepage"}]}}

http://jsonlint.com/中,结果为:

Error: Parse error on line 1:
{   'http://example.org/
--^
Expecting 'STRING', '}', got 'undefined'

将其修改为以下字符串可解决JSON错误:

{
    "http://example.org/about": {
        "http://purl.org/dc/terms/title": [{
            "type": "literal",
            "value": "Anna's Homepage"
        }]
    }
}

I’ve checked your JSON data

{'http://example.org/about': {'http://purl.org/dc/terms/title': [{'type': 'literal', 'value': "Anna's Homepage"}]}}

in http://jsonlint.com/ and the results were:

Error: Parse error on line 1:
{   'http://example.org/
--^
Expecting 'STRING', '}', got 'undefined'

modifying it to the following string solve the JSON error:

{
    "http://example.org/about": {
        "http://purl.org/dc/terms/title": [{
            "type": "literal",
            "value": "Anna's Homepage"
        }]
    }
}

回答 5

JSON字符串必须使用双引号。JSON python库强制执行此操作,因此您无法加载字符串。您的数据需要如下所示:

{"http://example.org/about": {"http://purl.org/dc/terms/title": [{"type": "literal", "value": "Anna's Homepage"}]}}

如果那不是您可以做的,则可以使用ast.literal_eval()代替json.loads()

JSON strings must use double quotes. The JSON python library enforces this so you are unable to load your string. Your data needs to look like this:

{"http://example.org/about": {"http://purl.org/dc/terms/title": [{"type": "literal", "value": "Anna's Homepage"}]}}

If that’s not something you can do, you could use ast.literal_eval() instead of json.loads()


回答 6

import ast

inpt = {'http://example.org/about': {'http://purl.org/dc/terms/title':
                                     [{'type': 'literal', 'value': "Anna's Homepage"}]}}

json_data = ast.literal_eval(json.dumps(inpt))

print(json_data)

这样可以解决问题。

import ast

inpt = {'http://example.org/about': {'http://purl.org/dc/terms/title':
                                     [{'type': 'literal', 'value': "Anna's Homepage"}]}}

json_data = ast.literal_eval(json.dumps(inpt))

print(json_data)

this will solve the problem.


回答 7

正如错误地明确指出的那样,名称应该用双引号而不是单引号引起来。您传递的字符串不是有效的JSON。它看起来像

{"http://example.org/about": {"http://purl.org/dc/terms/title": [{"type": "literal", "value": "Anna's Homepage"}]}}

As it clearly says in error, names should be enclosed in double quotes instead of single quotes. The string you pass is just not a valid JSON. It should look like

{"http://example.org/about": {"http://purl.org/dc/terms/title": [{"type": "literal", "value": "Anna's Homepage"}]}}

回答 8

我使用这种方法并设法获得所需的输出。我的剧本

x = "{'inner-temperature': 31.73, 'outer-temperature': 28.38, 'keys-value': 0}"

x = x.replace("'", '"')
j = json.loads(x)
print(j['keys-value'])

输出

>>> 0

I used this method and managed to get the desired output. my script

x = "{'inner-temperature': 31.73, 'outer-temperature': 28.38, 'keys-value': 0}"

x = x.replace("'", '"')
j = json.loads(x)
print(j['keys-value'])

output

>>> 0

回答 9

with open('input.json','r') as f:
    s = f.read()
    s = s.replace('\'','\"')
    data = json.loads(s)

这对我来说效果很好。谢谢。

with open('input.json','r') as f:
    s = f.read()
    s = s.replace('\'','\"')
    data = json.loads(s)

This worked perfectly well for me. Thanks.


回答 10

x = x.replace("'", '"')
j = json.loads(x)

虽然这是正确的解决方案,但是如果存在这样的JSON,则可能会引起很多麻烦-

{'status': 'success', 'data': {'equity': {'enabled': True, 'net': 66706.14510000008, 'available': {'adhoc_margin': 0, 'cash': 1277252.56, 'opening_balance': 1277252.56, 'live_balance': 66706.14510000008, 'collateral': 249823.93, 'intraday_payin': 15000}, 'utilised': {'debits': 1475370.3449, 'exposure': 607729.3129, 'm2m_realised': 0, 'm2m_unrealised': -9033, 'option_premium': 0, 'payout': 0, 'span': 858608.032, 'holding_sales': 0, 'turnover': 0, 'liquid_collateral': 0, 'stock_collateral': 249823.93}}, 'commodity': {'enabled': True, 'net': 0, 'available': {'adhoc_margin': 0, 'cash': 0, 'opening_balance': 0, 'live_balance': 0, 'collateral': 0, 'intraday_payin': 0}, 'utilised': {'debits': 0, 'exposure': 0, 'm2m_realised': 0, 'm2m_unrealised': 0, 'option_premium': 0, 'payout': 0, 'span': 0, 'holding_sales': 0, 'turnover': 0, 'liquid_collateral': 0, 'stock_collateral': 0}}}}

注意到“ True”值了吗?使用它可以对布尔值进行双重检查。这将涵盖这些情况-

x = x.replace("'", '"').replace("True", '"True"').replace("False", '"False"').replace("null", '"null"')
j = json.loads(x)

另外,请确保您不

x = json.loads(x)

它必须是另一个变量。

x = x.replace("'", '"')
j = json.loads(x)

Although this is the correct solution, but it may lead to quite a headache if there a JSON like this –

{'status': 'success', 'data': {'equity': {'enabled': True, 'net': 66706.14510000008, 'available': {'adhoc_margin': 0, 'cash': 1277252.56, 'opening_balance': 1277252.56, 'live_balance': 66706.14510000008, 'collateral': 249823.93, 'intraday_payin': 15000}, 'utilised': {'debits': 1475370.3449, 'exposure': 607729.3129, 'm2m_realised': 0, 'm2m_unrealised': -9033, 'option_premium': 0, 'payout': 0, 'span': 858608.032, 'holding_sales': 0, 'turnover': 0, 'liquid_collateral': 0, 'stock_collateral': 249823.93}}, 'commodity': {'enabled': True, 'net': 0, 'available': {'adhoc_margin': 0, 'cash': 0, 'opening_balance': 0, 'live_balance': 0, 'collateral': 0, 'intraday_payin': 0}, 'utilised': {'debits': 0, 'exposure': 0, 'm2m_realised': 0, 'm2m_unrealised': 0, 'option_premium': 0, 'payout': 0, 'span': 0, 'holding_sales': 0, 'turnover': 0, 'liquid_collateral': 0, 'stock_collateral': 0}}}}

Noticed that “True” value? Use this to make things are double checked for Booleans. This will cover those cases –

x = x.replace("'", '"').replace("True", '"True"').replace("False", '"False"').replace("null", '"null"')
j = json.loads(x)

Also, make sure you do not make

x = json.loads(x)

It has to be another variable.


回答 11

我有类似的问题。彼此通信的两个组件正在使用队列。

在将消息放入队列之前,第一个组件没有执行json.dumps。因此,接收组件生成的JSON字符串用单引号引起来。这导致错误

 Expecting property name enclosed in double quotes

添加json.dumps开始创建格式正确的JSON,并解决了问题。

I had similar problem . Two components communicating with each other was using a queue .

First component was not doing json.dumps before putting message to queue. So the JSON string generated by receiving component was in single quotes. This was causing error

 Expecting property name enclosed in double quotes

Adding json.dumps started creating correctly formatted JSON & solved issue.


回答 12

使用eval功能。

它解决了单引号和双引号之间的差异。

Use the eval function.

It takes care of the discrepancy between single and double quotes.


回答 13

其他答案很好地解释了错误,因为传递给json模块的引号字符无效。

就我而言,即使在字符串中替换'",我仍然继续出现ValueError 。我最终意识到,有些引号形式的unicode符号已进入我的字符串中:

           `  ´     

要清理所有这些,您只需通过正则表达式传递字符串即可:

import re

raw_string = '{“key”:“value”}'

parsed_string = re.sub(r"[“|”|‛|’|‘|`|´|″|′|']", '"', my_string)

json_object = json.loads(parsed_string)

As the other answers explain well the error occurs because of invalid quote characters passed to the json module.

In my case I continued to get the ValueError even after replacing ' with " in my string. What I finally realized was that some quote-like unicode symbols had found their way into my string:

 “  ”  ‛  ’  ‘  `  ´  ″  ′ 

To clean all of these you can just pass your string through a regular expression:

import re

raw_string = '{“key”:“value”}'

parsed_string = re.sub(r"[“|”|‛|’|‘|`|´|″|′|']", '"', my_string)

json_object = json.loads(parsed_string)


回答 14

手动编辑JSON时,我多次遇到此问题。如果有人要在不注意的情况下从文件中删除某些内容,则可能会引发相同的错误。

例如,如果缺少JSON最后的“}”,它将抛出相同的错误。

因此,如果您手动编辑文件,请确保按照JSON解码器的要求格式化文件,否则会遇到相同的问题。

希望这可以帮助!

I have run into this problem multiple times when the JSON has been edited by hand. If someone was to delete something from the file without noticing it can throw the same error.

For instance, If your JSON last “}” is missing it will throw the same error.

So If you edit you file by hand make sure you format it like it is expected by the JSON decoder, otherwise you will run into the same problem.

Hope this helps!


回答 15

使用该json.dumps()方法总是理想的。为了摆脱此错误,我使用了以下代码

json.dumps(YOUR_DICT_STRING).replace("'", '"')

It is always ideal to use the json.dumps() method. To get rid of this error, I used the following code

json.dumps(YOUR_DICT_STRING).replace("'", '"')

如何在Python中拆分和解析字符串?

问题:如何在Python中拆分和解析字符串?

我正在尝试在python中拆分此字符串: 2.7.0_bf4fda703454

我想在下划线处拆分该字符串,_以便可以使用左侧的值。

I am trying to split this string in python: 2.7.0_bf4fda703454

I want to split that string on the underscore _ so that I can use the value on the left side.


回答 0

"2.7.0_bf4fda703454".split("_") 给出字符串列表:

In [1]: "2.7.0_bf4fda703454".split("_")
Out[1]: ['2.7.0', 'bf4fda703454']

这会在每个下划线处拆分字符串。如果要在第一次拆分后停止它,请使用"2.7.0_bf4fda703454".split("_", 1)

如果您知道字符串中包含下划线,那么您甚至可以将LHS和RHS解压缩为单独的变量:

In [8]: lhs, rhs = "2.7.0_bf4fda703454".split("_", 1)

In [9]: lhs
Out[9]: '2.7.0'

In [10]: rhs
Out[10]: 'bf4fda703454'

另一种方法是使用partition()。用法与上一个示例类似,不同之处在于它返回三个组件而不是两个。主要优点是,如果字符串不包含分隔符,则此方法不会失败。

"2.7.0_bf4fda703454".split("_") gives a list of strings:

In [1]: "2.7.0_bf4fda703454".split("_")
Out[1]: ['2.7.0', 'bf4fda703454']

This splits the string at every underscore. If you want it to stop after the first split, use "2.7.0_bf4fda703454".split("_", 1).

If you know for a fact that the string contains an underscore, you can even unpack the LHS and RHS into separate variables:

In [8]: lhs, rhs = "2.7.0_bf4fda703454".split("_", 1)

In [9]: lhs
Out[9]: '2.7.0'

In [10]: rhs
Out[10]: 'bf4fda703454'

An alternative is to use partition(). The usage is similar to the last example, except that it returns three components instead of two. The principal advantage is that this method doesn’t fail if the string doesn’t contain the separator.


回答 1

Python字符串解析演练

在空间上分割字符串,获取列表,显示其类型,然后打印出来:

el@apollo:~/foo$ python
>>> mystring = "What does the fox say?"

>>> mylist = mystring.split(" ")

>>> print type(mylist)
<type 'list'>

>>> print mylist
['What', 'does', 'the', 'fox', 'say?']

如果两个相邻的定界符相邻,则假定为空字符串:

el@apollo:~/foo$ python
>>> mystring = "its  so   fluffy   im gonna    DIE!!!"

>>> print mystring.split(" ")
['its', '', 'so', '', '', 'fluffy', '', '', 'im', 'gonna', '', '', '', 'DIE!!!']

在下划线处分割字符串,并获取列表中的第五项:

el@apollo:~/foo$ python
>>> mystring = "Time_to_fire_up_Kowalski's_Nuclear_reactor."

>>> mystring.split("_")[4]
"Kowalski's"

将多个空格合为一个

el@apollo:~/foo$ python
>>> mystring = 'collapse    these       spaces'

>>> mycollapsedstring = ' '.join(mystring.split())

>>> print mycollapsedstring.split(' ')
['collapse', 'these', 'spaces']

当您不向Python的split方法传递任何参数时,文档指出:“连续的空白行被视为单个分隔符,如果字符串的开头或结尾是空白,则结果在开头或结尾将不包含空字符串”。

抓住男孩们的帽子,解析一个正则表达式:

el@apollo:~/foo$ python
>>> mystring = 'zzzzzzabczzzzzzdefzzzzzzzzzghizzzzzzzzzzzz'
>>> import re
>>> mylist = re.split("[a-m]+", mystring)
>>> print mylist
['zzzzzz', 'zzzzzz', 'zzzzzzzzz', 'zzzzzzzzzzzz']

正则表达式“[AM] +”装置中的小写字母a通过m发生一次或多次被作为分隔符相匹配。re是要导入的库。

或者,如果您想一次将一项剁碎:

el@apollo:~/foo$ python
>>> mystring = "theres coffee in that nebula"

>>> mytuple = mystring.partition(" ")

>>> print type(mytuple)
<type 'tuple'>

>>> print mytuple
('theres', ' ', 'coffee in that nebula')

>>> print mytuple[0]
theres

>>> print mytuple[2]
coffee in that nebula

Python string parsing walkthrough

Split a string on space, get a list, show its type, print it out:

el@apollo:~/foo$ python
>>> mystring = "What does the fox say?"

>>> mylist = mystring.split(" ")

>>> print type(mylist)
<type 'list'>

>>> print mylist
['What', 'does', 'the', 'fox', 'say?']

If you have two delimiters next to each other, empty string is assumed:

el@apollo:~/foo$ python
>>> mystring = "its  so   fluffy   im gonna    DIE!!!"

>>> print mystring.split(" ")
['its', '', 'so', '', '', 'fluffy', '', '', 'im', 'gonna', '', '', '', 'DIE!!!']

Split a string on underscore and grab the 5th item in the list:

el@apollo:~/foo$ python
>>> mystring = "Time_to_fire_up_Kowalski's_Nuclear_reactor."

>>> mystring.split("_")[4]
"Kowalski's"

Collapse multiple spaces into one

el@apollo:~/foo$ python
>>> mystring = 'collapse    these       spaces'

>>> mycollapsedstring = ' '.join(mystring.split())

>>> print mycollapsedstring.split(' ')
['collapse', 'these', 'spaces']

When you pass no parameter to Python’s split method, the documentation states: “runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace”.

Hold onto your hats boys, parse on a regular expression:

el@apollo:~/foo$ python
>>> mystring = 'zzzzzzabczzzzzzdefzzzzzzzzzghizzzzzzzzzzzz'
>>> import re
>>> mylist = re.split("[a-m]+", mystring)
>>> print mylist
['zzzzzz', 'zzzzzz', 'zzzzzzzzz', 'zzzzzzzzzzzz']

The regular expression “[a-m]+” means the lowercase letters a through m that occur one or more times are matched as a delimiter. re is a library to be imported.

Or if you want to chomp the items one at a time:

el@apollo:~/foo$ python
>>> mystring = "theres coffee in that nebula"

>>> mytuple = mystring.partition(" ")

>>> print type(mytuple)
<type 'tuple'>

>>> print mytuple
('theres', ' ', 'coffee in that nebula')

>>> print mytuple[0]
theres

>>> print mytuple[2]
coffee in that nebula

回答 2

如果总是将LHS / RHS分开,则也可以使用partition字符串中内置的方法。它返回一个三元组,就(LHS, separator, RHS)好像找到了分隔符,(original_string, '', '')如果不存在分隔符:

>>> "2.7.0_bf4fda703454".partition('_')
('2.7.0', '_', 'bf4fda703454')

>>> "shazam".partition("_")
('shazam', '', '')

If it’s always going to be an even LHS/RHS split, you can also use the partition method that’s built into strings. It returns a 3-tuple as (LHS, separator, RHS) if the separator is found, and (original_string, '', '') if the separator wasn’t present:

>>> "2.7.0_bf4fda703454".partition('_')
('2.7.0', '_', 'bf4fda703454')

>>> "shazam".partition("_")
('shazam', '', '')

查找字符串中子字符串的最后一次出现,将其替换

问题:查找字符串中子字符串的最后一次出现,将其替换

因此,我有一长串具有相同格式的字符串,并且我想找到最后一个“”。每个字符,然后将其替换为“。-”。我尝试使用rfind,但似乎无法正确利用它来执行此操作。

So I have a long list of strings in the same format, and I want to find the last “.” character in each one, and replace it with “. – “. I’ve tried using rfind, but I can’t seem to utilize it properly to do this.


回答 0

这应该做

old_string = "this is going to have a full stop. some written sstuff!"
k = old_string.rfind(".")
new_string = old_string[:k] + ". - " + old_string[k+1:]

This should do it

old_string = "this is going to have a full stop. some written sstuff!"
k = old_string.rfind(".")
new_string = old_string[:k] + ". - " + old_string[k+1:]

回答 1

要从右侧替换:

def replace_right(source, target, replacement, replacements=None):
    return replacement.join(source.rsplit(target, replacements))

正在使用:

>>> replace_right("asd.asd.asd.", ".", ". -", 1)
'asd.asd.asd. -'

To replace from the right:

def replace_right(source, target, replacement, replacements=None):
    return replacement.join(source.rsplit(target, replacements))

In use:

>>> replace_right("asd.asd.asd.", ".", ". -", 1)
'asd.asd.asd. -'

回答 2

我会使用正则表达式:

import re
new_list = [re.sub(r"\.(?=[^.]*$)", r". - ", s) for s in old_list]

I would use a regex:

import re
new_list = [re.sub(r"\.(?=[^.]*$)", r". - ", s) for s in old_list]

回答 3

一行代码是:

str=str[::-1].replace(".",".-",1)[::-1]

A one liner would be :

str=str[::-1].replace(".",".-",1)[::-1]


回答 4

您可以使用下面的函数来代替从右至右的单词的第一个出现。

def replace_from_right(text: str, original_text: str, new_text: str) -> str:
    """ Replace first occurrence of original_text by new_text. """
    return text[::-1].replace(original_text[::-1], new_text[::-1], 1)[::-1]

You can use the function below which replaces the first occurrence of the word from right.

def replace_from_right(text: str, original_text: str, new_text: str) -> str:
    """ Replace first occurrence of original_text by new_text. """
    return text[::-1].replace(original_text[::-1], new_text[::-1], 1)[::-1]

回答 5

a = "A long string with a . in the middle ending with ."

#如果要查找任何字符串的最后一次出现的索引,在我们的示例中,我们#将查找with的最后一次出现的索引

index = a.rfind("with") 

#结果将是44,因为索引从0开始。

a = "A long string with a . in the middle ending with ."

# if you want to find the index of the last occurrence of any string, In our case we #will find the index of the last occurrence of with

index = a.rfind("with") 

# the result will be 44, as index starts from 0.


回答 6

天真的方法:

a = "A long string with a . in the middle ending with ."
fchar = '.'
rchar = '. -'
a[::-1].replace(fchar, rchar[::-1], 1)[::-1]

Out[2]: 'A long string with a . in the middle ending with . -'

Aditya Sihag的回答只有一个rfind

pos = a.rfind('.')
a[:pos] + '. -' + a[pos+1:]

Naïve approach:

a = "A long string with a . in the middle ending with ."
fchar = '.'
rchar = '. -'
a[::-1].replace(fchar, rchar[::-1], 1)[::-1]

Out[2]: 'A long string with a . in the middle ending with . -'

Aditya Sihag’s answer with a single rfind:

pos = a.rfind('.')
a[:pos] + '. -' + a[pos+1:]

字典python的URL查询参数

问题:字典python的URL查询参数

有没有一种方法可以解析网址(带有某些python库)并返回带有该网址查询参数部分的键和值的python字典?

例如:

url = "http://www.example.org/default.html?ct=32&op=92&item=98"

预期收益:

{'ct':32, 'op':92, 'item':98}

Is there a way to parse a URL (with some python library) and return a python dictionary with the keys and values of a query parameters part of the URL?

For example:

url = "http://www.example.org/default.html?ct=32&op=92&item=98"

expected return:

{'ct':32, 'op':92, 'item':98}

回答 0

使用urllib.parse

>>> from urllib import parse
>>> url = "http://www.example.org/default.html?ct=32&op=92&item=98"
>>> parse.urlsplit(url)
SplitResult(scheme='http', netloc='www.example.org', path='/default.html', query='ct=32&op=92&item=98', fragment='')
>>> parse.parse_qs(parse.urlsplit(url).query)
{'item': ['98'], 'op': ['92'], 'ct': ['32']}
>>> dict(parse.parse_qsl(parse.urlsplit(url).query))
{'item': '98', 'op': '92', 'ct': '32'}

urllib.parse.parse_qs()urllib.parse.parse_qsl()方法解析出查询字符串,考虑到钥匙可能会出现不止一次和顺序可能无关紧要。

如果您仍在使用Python 2,urllib.parse则称为urlparse

Use the urllib.parse library:

>>> from urllib import parse
>>> url = "http://www.example.org/default.html?ct=32&op=92&item=98"
>>> parse.urlsplit(url)
SplitResult(scheme='http', netloc='www.example.org', path='/default.html', query='ct=32&op=92&item=98', fragment='')
>>> parse.parse_qs(parse.urlsplit(url).query)
{'item': ['98'], 'op': ['92'], 'ct': ['32']}
>>> dict(parse.parse_qsl(parse.urlsplit(url).query))
{'item': '98', 'op': '92', 'ct': '32'}

The urllib.parse.parse_qs() and urllib.parse.parse_qsl() methods parse out query strings, taking into account that keys can occur more than once and that order may matter.

If you are still on Python 2, urllib.parse was called urlparse.


回答 1

对于Python 3,dict from的值parse_qs在列表中,因为可能有多个值。如果您只想要第一个:

>>> from urllib.parse import urlsplit, parse_qs
>>>
>>> url = "http://www.example.org/default.html?ct=32&op=92&item=98"
>>> query = urlsplit(url).query
>>> params = parse_qs(query)
>>> params
{'item': ['98'], 'op': ['92'], 'ct': ['32']}
>>> dict(params)
{'item': ['98'], 'op': ['92'], 'ct': ['32']}
>>> {k: v[0] for k, v in params.items()}
{'item': '98', 'op': '92', 'ct': '32'}

For Python 3, the values of the dict from parse_qs are in a list, because there might be multiple values. If you just want the first one:

>>> from urllib.parse import urlsplit, parse_qs
>>>
>>> url = "http://www.example.org/default.html?ct=32&op=92&item=98"
>>> query = urlsplit(url).query
>>> params = parse_qs(query)
>>> params
{'item': ['98'], 'op': ['92'], 'ct': ['32']}
>>> dict(params)
{'item': ['98'], 'op': ['92'], 'ct': ['32']}
>>> {k: v[0] for k, v in params.items()}
{'item': '98', 'op': '92', 'ct': '32'}

回答 2

如果您不想使用解析器:

url = "http://www.example.org/default.html?ct=32&op=92&item=98"
url = url.split("?")[1]
dict = {x[0] : x[1] for x in [x.split("=") for x in url[1:].split("&") ]}

因此,我不会删除上面的内容,但是绝对不是您应该使用的内容。

我想我读了一些答案,而且它们看起来有些复杂,以防万一您像我一样,不要使用我的解决方案。

用这个:

from urllib import parse
params = dict(parse.parse_qsl(parse.urlsplit(url).query))

而对于Python 2.X

import urlparse as parse
params = dict(parse.parse_qsl(parse.urlsplit(url).query))

我知道这与接受的答案相同,只是在一个可以复制的衬里上。

If you prefer not to use a parser:

url = "http://www.example.org/default.html?ct=32&op=92&item=98"
url = url.split("?")[1]
dict = {x[0] : x[1] for x in [x.split("=") for x in url[1:].split("&") ]}

So I won’t delete what’s above but it’s definitely not what you should use.

I think I read a few of the answers and they looked a little complicated, incase you’re like me, don’t use my solution.

Use this:

from urllib import parse
params = dict(parse.parse_qsl(parse.urlsplit(url).query))

and for Python 2.X

import urlparse as parse
params = dict(parse.parse_qsl(parse.urlsplit(url).query))

I know this is the same as the accepted answer, just in a one liner that can be copied.


回答 3

对于python 2.7

In [14]: url = "http://www.example.org/default.html?ct=32&op=92&item=98"

In [15]: from urlparse import urlparse, parse_qsl

In [16]: parse_url = urlparse(url)

In [17]: query_dict = dict(parse_qsl(parse_url.query))

In [18]: query_dict
Out[18]: {'ct': '32', 'item': '98', 'op': '92'}

For python 2.7

In [14]: url = "http://www.example.org/default.html?ct=32&op=92&item=98"

In [15]: from urlparse import urlparse, parse_qsl

In [16]: parse_url = urlparse(url)

In [17]: query_dict = dict(parse_qsl(parse_url.query))

In [18]: query_dict
Out[18]: {'ct': '32', 'item': '98', 'op': '92'}

回答 4

我同意不重新发明轮子,但有时(在您学习时)有助于构建轮子以便理解轮子。:)因此,从纯粹的学术角度来看,我提供了一个警告,即使用字典假定名称/值对是唯一的(查询字符串不包含多个记录)。

url = 'http:/mypage.html?one=1&two=2&three=3'

page, query = url.split('?')

names_values_dict = dict(pair.split('=') for pair in query.split('&'))

names_values_list = [pair.split('=') for pair in query.split('&')]

我在空闲IDE中使用3.6.5版。

I agree about not reinventing the wheel but sometimes (while you’re learning) it helps to build a wheel in order to understand a wheel. :) So, from a purely academic perspective, I offer this with the caveat that using a dictionary assumes that name value pairs are unique (that the query string does not contain multiple records).

url = 'http:/mypage.html?one=1&two=2&three=3'

page, query = url.split('?')

names_values_dict = dict(pair.split('=') for pair in query.split('&'))

names_values_list = [pair.split('=') for pair in query.split('&')]

I’m using version 3.6.5 in the Idle IDE.


回答 5

对于python2.7我正在使用urlparse模块来解析URL查询到字典。

import urlparse

url = "http://www.example.org/default.html?ct=32&op=92&item=98"

print urlparse.parse_qs( urlparse.urlparse(url).query )
# result: {'item': ['98'], 'op': ['92'], 'ct': ['32']} 

For python2.7 I am using urlparse module to parse url query to dict.

import urlparse

url = "http://www.example.org/default.html?ct=32&op=92&item=98"

print urlparse.parse_qs( urlparse.urlparse(url).query )
# result: {'item': ['98'], 'op': ['92'], 'ct': ['32']} 

Antlr4-ANTLR 是一个功能强大的解析器生成器,用于读取、处理、执行或翻译结构化文本或二进制文件

ANTLR(另一个用于语言识别的工具)是一个功能强大的解析器生成器,用于读取、处理、执行或翻译结构化文本或二进制文件。它被广泛用于构建语言、工具和框架。根据语法,ANTLR生成一个解析器,该解析器可以构建解析树,还可以生成一个侦听器接口(或访问器),从而可以轻松地响应感兴趣的短语的识别

考虑到日间工作的限制,我在这个项目上的工作时间有限,因此我必须首先专注于修复bug,而不是更改/改进功能集。很可能我每隔几个月就会突然做一次。如果您的bug或Pull请求没有产生响应,请不要生气!–parrt

作者和主要贡献者

有用的信息

您可能还会发现以下页面很有用,特别是当您想要使用各种目标语言时

权威的ANTLR 4参考

程序员总是遇到解析问题。无论是JSON这样的数据格式,SMTP这样的网络协议,Apache的服务器配置文件,PostScript/PDF文件,还是简单的电子表格宏语言-ANTLR v4,本书都将揭开这个过程的神秘面纱。ANTLRv4已经从头开始重写,使得构建解析器和构建在其上的语言应用程序比以往任何时候都更加容易。这本完全改写的新版畅销ANTLR权威参考向您展示了如何利用这些新功能

你可以买这本书The Definitive ANTLR 4 Reference在亚马逊或electronic version at the publisher’s site

您会发现Book source code有用的

附加语法

This repository是不带动作的语法集合,其中根目录名是语法分析的语言的全小写名称。例如,java、cpp、cSharp、c等

如何将逗号分隔的字符串转换为Python中的列表?

问题:如何将逗号分隔的字符串转换为Python中的列表?

给定一个字符串,该字符串是由逗号分隔的多个值的序列:

mStr = 'A,B,C,D,E' 

如何将字符串转换为列表?

mList = ['A', 'B', 'C', 'D', 'E']

Given a string that is a sequence of several values separated by a commma:

mStr = 'A,B,C,D,E' 

How do I convert the string to a list?

mList = ['A', 'B', 'C', 'D', 'E']

回答 0

您可以使用str.split方法。

>>> my_string = 'A,B,C,D,E'
>>> my_list = my_string.split(",")
>>> print my_list
['A', 'B', 'C', 'D', 'E']

如果要将其转换为元组,只需

>>> print tuple(my_list)
('A', 'B', 'C', 'D', 'E')

如果您希望追加到列表,请尝试以下操作:

>>> my_list.append('F')
>>> print my_list
['A', 'B', 'C', 'D', 'E', 'F']

You can use the str.split method.

>>> my_string = 'A,B,C,D,E'
>>> my_list = my_string.split(",")
>>> print my_list
['A', 'B', 'C', 'D', 'E']

If you want to convert it to a tuple, just

>>> print tuple(my_list)
('A', 'B', 'C', 'D', 'E')

If you are looking to append to a list, try this:

>>> my_list.append('F')
>>> print my_list
['A', 'B', 'C', 'D', 'E', 'F']

回答 1

对于字符串中包含的整数,如果要避免将它们int分别转换为整数,可以执行以下操作:

mList = [int(e) if e.isdigit() else e for e in mStr.split(',')]

这称为列表理解,它基于集合构建器符号。

例如:

>>> mStr = "1,A,B,3,4"
>>> mList = [int(e) if e.isdigit() else e for e in mStr.split(',')]
>>> mList
>>> [1,'A','B',3,4]

In the case of integers that are included at the string, if you want to avoid casting them to int individually you can do:

mList = [int(e) if e.isdigit() else e for e in mStr.split(',')]

It is called list comprehension, and it is based on set builder notation.

ex:

>>> mStr = "1,A,B,3,4"
>>> mList = [int(e) if e.isdigit() else e for e in mStr.split(',')]
>>> mList
>>> [1,'A','B',3,4]

回答 2

>>> some_string='A,B,C,D,E'
>>> new_tuple= tuple(some_string.split(','))
>>> new_tuple
('A', 'B', 'C', 'D', 'E')
>>> some_string='A,B,C,D,E'
>>> new_tuple= tuple(some_string.split(','))
>>> new_tuple
('A', 'B', 'C', 'D', 'E')

回答 3

您可以使用此功能将以逗号分隔的单个字符串转换为list-

def stringtolist(x):
    mylist=[]
    for i in range(0,len(x),2):
        mylist.append(x[i])
    return mylist

You can use this function to convert comma-delimited single character strings to list-

def stringtolist(x):
    mylist=[]
    for i in range(0,len(x),2):
        mylist.append(x[i])
    return mylist

回答 4

#splits string according to delimeters 
'''
Let's make a function that can split a string
into list according the given delimeters. 
example data: cat;dog:greff,snake/
example delimeters: ,;- /|:
'''
def string_to_splitted_array(data,delimeters):
    #result list
    res = []
    # we will add chars into sub_str until
    # reach a delimeter
    sub_str = ''
    for c in data: #iterate over data char by char
        # if we reached a delimeter, we store the result 
        if c in delimeters: 
            # avoid empty strings
            if len(sub_str)>0:
                # looks like a valid string.
                res.append(sub_str)
                # reset sub_str to start over
                sub_str = ''
        else:
            # c is not a deilmeter. then it is 
            # part of the string.
            sub_str += c
    # there may not be delimeter at end of data. 
    # if sub_str is not empty, we should att it to list. 
    if len(sub_str)>0:
        res.append(sub_str)
    # result is in res 
    return res

# test the function. 
delimeters = ',;- /|:'
# read the csv data from console. 
csv_string = input('csv string:')
#lets check if working. 
splitted_array = string_to_splitted_array(csv_string,delimeters)
print(splitted_array)
#splits string according to delimeters 
'''
Let's make a function that can split a string
into list according the given delimeters. 
example data: cat;dog:greff,snake/
example delimeters: ,;- /|:
'''
def string_to_splitted_array(data,delimeters):
    #result list
    res = []
    # we will add chars into sub_str until
    # reach a delimeter
    sub_str = ''
    for c in data: #iterate over data char by char
        # if we reached a delimeter, we store the result 
        if c in delimeters: 
            # avoid empty strings
            if len(sub_str)>0:
                # looks like a valid string.
                res.append(sub_str)
                # reset sub_str to start over
                sub_str = ''
        else:
            # c is not a deilmeter. then it is 
            # part of the string.
            sub_str += c
    # there may not be delimeter at end of data. 
    # if sub_str is not empty, we should att it to list. 
    if len(sub_str)>0:
        res.append(sub_str)
    # result is in res 
    return res

# test the function. 
delimeters = ',;- /|:'
# read the csv data from console. 
csv_string = input('csv string:')
#lets check if working. 
splitted_array = string_to_splitted_array(csv_string,delimeters)
print(splitted_array)

回答 5

考虑以下内容以处理空字符串的情况:

>>> my_string = 'A,B,C,D,E'
>>> my_string.split(",") if my_string else []
['A', 'B', 'C', 'D', 'E']
>>> my_string = ""
>>> my_string.split(",") if my_string else []
[]

Consider the following in order to handle the case of an empty string:

>>> my_string = 'A,B,C,D,E'
>>> my_string.split(",") if my_string else []
['A', 'B', 'C', 'D', 'E']
>>> my_string = ""
>>> my_string.split(",") if my_string else []
[]

回答 6

您可以拆分该字符串,并直接获取列表:

mStr = 'A,B,C,D,E'
list1 = mStr.split(',')
print(list1)

输出:

['A', 'B', 'C', 'D', 'E']

您还可以将其转换为n元组:

print(tuple(list1))

输出:

('A', 'B', 'C', 'D', 'E')

You can split that string on , and directly get a list:

mStr = 'A,B,C,D,E'
list1 = mStr.split(',')
print(list1)

Output:

['A', 'B', 'C', 'D', 'E']

You can also convert it to an n-tuple:

print(tuple(list1))

Output:

('A', 'B', 'C', 'D', 'E')


从URL检索参数

问题:从URL检索参数

给定如下所示的URL,如何解析查询参数的值?例如,在这种情况下,我需要的值def

/abc?def='ghi'

我在我的环境中使用Django;有没有一种方法request可以帮助我?

我尝试使用,self.request.get('def')但是它没有返回ghi我希望的值。

Given a URL like the following, how can I parse the value of the query parameters? For example, in this case I want the value of def.

/abc?def='ghi'

I am using Django in my environment; is there a method on the request object that could help me?

I tried using self.request.get('def') but it is not returning the value ghi as I had hoped.


回答 0

Python 2:

import urlparse
url = 'http://foo.appspot.com/abc?def=ghi'
parsed = urlparse.urlparse(url)
print urlparse.parse_qs(parsed.query)['def']

Python 3:

import urllib.parse as urlparse
from urllib.parse import parse_qs
url = 'http://foo.appspot.com/abc?def=ghi'
parsed = urlparse.urlparse(url)
print(parse_qs(parsed.query)['def'])

parse_qs返回值列表,因此上述代码将打印['ghi']

这是Python 3文档。

Python 2:

import urlparse
url = 'http://foo.appspot.com/abc?def=ghi'
parsed = urlparse.urlparse(url)
print urlparse.parse_qs(parsed.query)['def']

Python 3:

import urllib.parse as urlparse
from urllib.parse import parse_qs
url = 'http://foo.appspot.com/abc?def=ghi'
parsed = urlparse.urlparse(url)
print(parse_qs(parsed.query)['def'])

parse_qs returns a list of values, so the above code will print ['ghi'].

Here’s the Python 3 documentation.


回答 1

我很震惊这个解决方案还没有在这里。用:

request.GET.get('variable_name')

这将从“ GET”字典中“获取”变量,并返回“ variable_name”值(如果存在),或者返回None对象(如果不存在)。

I’m shocked this solution isn’t on here already. Use:

request.GET.get('variable_name')

This will “get” the variable from the “GET” dictionary, and return the ‘variable_name’ value if it exists, or a None object if it doesn’t exist.


回答 2

import urlparse
url = 'http://example.com/?q=abc&p=123'
par = urlparse.parse_qs(urlparse.urlparse(url).query)

print par['q'][0], par['p'][0]
import urlparse
url = 'http://example.com/?q=abc&p=123'
par = urlparse.parse_qs(urlparse.urlparse(url).query)

print par['q'][0], par['p'][0]

回答 3

对于Python> 3.4

from urllib import parse
url = 'http://foo.appspot.com/abc?def=ghi'
query_def=parse.parse_qs(parse.urlparse(url).query)['def'][0]

for Python > 3.4

from urllib import parse
url = 'http://foo.appspot.com/abc?def=ghi'
query_def=parse.parse_qs(parse.urlparse(url).query)['def'][0]

回答 4

有一个名为furl的新库。我发现这个库是执行url代数的最pythonic。安装:

pip install furl

码:

from furl import furl
f = furl("/abc?def='ghi'") 
print f.args['def']

There is a new library called furl. I find this library to be most pythonic for doing url algebra. To install:

pip install furl

Code:

from furl import furl
f = furl("/abc?def='ghi'") 
print f.args['def']

回答 5

我知道这有点晚了,但是自从我今天在这里找到自己以来,我认为这对其他人可能是一个有用的答案。

import urlparse
url = 'http://example.com/?q=abc&p=123'
parsed = urlparse.urlparse(url)
params = urlparse.parse_qsl(parsed.query)
for x,y in params:
    print "Parameter = "+x,"Value = "+y

使用parse_qsl(),“将数据作为名称,值对的列表返回。”

I know this is a bit late but since I found myself on here today, I thought that this might be a useful answer for others.

import urlparse
url = 'http://example.com/?q=abc&p=123'
parsed = urlparse.urlparse(url)
params = urlparse.parse_qsl(parsed.query)
for x,y in params:
    print "Parameter = "+x,"Value = "+y

With parse_qsl(), “Data are returned as a list of name, value pairs.”


回答 6

您引用的url是一种查询类型,我看到请求对象支持一种称为arguments的方法来获取查询参数。您可能还希望self.request.get('def')直接尝试从对象中获取价值。

The url you are referring is a query type and I see that the request object supports a method called arguments to get the query arguments. You may also want try self.request.get('def') directly to get your value from the object..


回答 7

def getParams(url):
    params = url.split("?")[1]
    params = params.split('=')
    pairs = zip(params[0::2], params[1::2])
    answer = dict((k,v) for k,v in pairs)

希望这可以帮助

def getParams(url):
    params = url.split("?")[1]
    params = params.split('=')
    pairs = zip(params[0::2], params[1::2])
    answer = dict((k,v) for k,v in pairs)

Hope this helps


回答 8

urlparse模块提供您所需的一切:

urlparse.parse_qs()

The urlparse module provides everything you need:

urlparse.parse_qs()


回答 9

不需要做任何事情。只有

self.request.get('variable_name')

注意,我没有指定方法(GET,POST等)。这是有据可查的这是一个示例

您使用Django模板的事实并不意味着处理程序也由Django处理

There’s not need to do any of that. Only with

self.request.get('variable_name')

Notice that I’m not specifying the method (GET, POST, etc). This is well documented and this is an example

The fact that you use Django templates doesn’t mean the handler is processed by Django as well


回答 10

在纯Python中:

def get_param_from_url(url, param_name):
    return [i.split("=")[-1] for i in url.split("?", 1)[-1].split("&") if i.startswith(param_name + "=")][0]

In pure Python:

def get_param_from_url(url, param_name):
    return [i.split("=")[-1] for i in url.split("?", 1)[-1].split("&") if i.startswith(param_name + "=")][0]

回答 11

import cgitb
cgitb.enable()

import cgi
print "Content-Type: text/plain;charset=utf-8"
print
form = cgi.FieldStorage()
i = int(form.getvalue('a'))+int(form.getvalue('b'))
print i
import cgitb
cgitb.enable()

import cgi
print "Content-Type: text/plain;charset=utf-8"
print
form = cgi.FieldStorage()
i = int(form.getvalue('a'))+int(form.getvalue('b'))
print i

回答 12

顺便说一句,我在使用parse_qs()并获取空值参数时遇到了问题,并了解到您必须传递第二个可选参数’keep_blank_values’才能在查询字符串中返回不包含任何值的参数列表。默认为false。某些糟糕的书面API要求参数存在,即使它们不包含任何值

for k,v in urlparse.parse_qs(p.query, True).items():
  print k

Btw, I was having issues using parse_qs() and getting empty value parameters and learned that you have to pass a second optional parameter ‘keep_blank_values’ to return a list of the parameters in a query string that contain no values. It defaults to false. Some crappy written APIs require parameters to be present even if they contain no values

for k,v in urlparse.parse_qs(p.query, True).items():
  print k

回答 13

有一个不错的库w3lib.url

from w3lib.url import url_query_parameter
url = "/abc?def=ghi"
print url_query_parameter(url, 'def')
ghi

There is a nice library w3lib.url

from w3lib.url import url_query_parameter
url = "/abc?def=ghi"
print url_query_parameter(url, 'def')
ghi

回答 14

参数= dict([part.split(’=’)代表get_parsed_url [4] .split(’&’)]的部分)

这很简单。可变参数将包含所有参数的字典。

parameters = dict([part.split(‘=’) for part in get_parsed_url[4].split(‘&’)])

This one is simple. The variable parameters will contain a dictionary of all the parameters.


回答 15

我看到龙卷风用户没有答案:

key = self.request.query_arguments.get("key", None)

此方法必须在派生自以下对象的处理程序中起作用:

tornado.web.RequestHandler

当找不到所请求的密钥时,此方法将返回的答案是无。这为您节省了一些异常处理。

I see there isn’t an answer for users of Tornado:

key = self.request.query_arguments.get("key", None)

This method must work inside an handler that is derived from:

tornado.web.RequestHandler

None is the answer this method will return when the requested key can’t be found. This saves you some exception handling.


在Python中拆分空字符串时,为什么split()返回空列表,而split(’\ n’)返回[”]?

问题:在Python中拆分空字符串时,为什么split()返回空列表,而split(’\ n’)返回[”]?

split('\n')用来获取一个字符串中的行,并发现''.split()返回一个空列表[],而''.split('\n')return ['']。有什么特殊原因造成这种差异?

还有没有更方便的方法来计算字符串中的行数?

I am using split('\n') to get lines in one string, and found that ''.split() returns an empty list, [], while ''.split('\n') returns ['']. Is there any specific reason for such a difference?

And is there any more convenient way to count lines in a string?


回答 0

问题:我正在使用split(’\ n’)在一个字符串中获取行,并发现”.split()返回空列表[],而”.split(’\ n’)返回[”] 。

所述str.split()方法有两种算法。如果未提供任何参数,它将在重复运行空白时拆分。但是,如果给出参数,则将其视为单个定界符,且不会重复运行。

在拆分空字符串的情况下,第一种模式(无参数)将返回一个空列表,因为空白被吃掉并且结果列表中没有任何值。

相比之下,第二种模式(带有参数如\n)将产生第一个空字段。考虑一下您是否写过'\n'.split('\n'),您将得到两个字段(一个字段拆分成两半)。

问题:有什么特殊原因造成这种差异?

当数据在具有可变空白量的列中对齐时,第一种模式很有用。例如:

>>> data = '''\
Shasta      California     14,200
McKinley    Alaska         20,300
Fuji        Japan          12,400
'''
>>> for line in data.splitlines():
        print line.split()

['Shasta', 'California', '14,200']
['McKinley', 'Alaska', '20,300']
['Fuji', 'Japan', '12,400']

第二种模式对于定界数据(例如CSV)很有用,其中重复的逗号表示空白字段。例如:

>>> data = '''\
Guido,BDFL,,Amsterdam
Barry,FLUFL,,USA
Tim,,,USA
'''
>>> for line in data.splitlines():
        print line.split(',')

['Guido', 'BDFL', '', 'Amsterdam']
['Barry', 'FLUFL', '', 'USA']
['Tim', '', '', 'USA']

注意,结果字段的数量比定界符的数量大一。想想剪一条绳子。如果不削减,则只有一件。一切,给出两块。进行两次切割,得到三块。Python的str.split(delimiter)方法也是如此:

>>> ''.split(',')       # No cuts
['']
>>> ','.split(',')      # One cut
['', '']
>>> ',,'.split(',')     # Two cuts
['', '', '']

问题:还有什么更方便的方法来计算字符串中的行数?

是的,有两种简单的方法。一个使用str.count(),另一个使用str.splitlines()。除非最后一行缺少,否则两种方法都将给出相同的答案\n。如果最后的换行符丢失,则str.splitlines方法将给出准确的答案。一种更快且更准确的技术是使用count方法,然后将其更正为最终的换行符:

>>> data = '''\
Line 1
Line 2
Line 3
Line 4'''

>>> data.count('\n')                               # Inaccurate
3
>>> len(data.splitlines())                         # Accurate, but slow
4
>>> data.count('\n') + (not data.endswith('\n'))   # Accurate and fast
4    

来自@Kaz的问题:为什么两个非常不同的算法被误用到一个函数中?

str.split的签名大约有20年的历史了,那个时代的许多API都是严格实用的。虽然并不完美,但方法签名也不是“糟糕的”。在大多数情况下,Guido的API设计选择经受了时间的考验。

当前的API并非没有优势。考虑如下字符串:

ps_aux_header  = "USER               PID  %CPU %MEM      VSZ"
patient_header = "name,age,height,weight"

当要求将这些字符串分成多个字段时,人们倾向于使用相同的英语单词“ split”来描述这两个字符串。当要求读取诸如fields = line.split() 或的代码时fields = line.split(','),人们倾向于正确地将语句解释为“将行拆分为字段”。

Microsoft Excel的“ 文本到列”工具做出了类似的API选择,并将两种分割算法都合并到了同一工具中。尽管似乎涉及多个算法,但人们似乎在思维上将字段拆分建模为一个单独的概念。

Question: I am using split(‘\n’) to get lines in one string, and found that ”.split() returns empty list [], while ”.split(‘\n’) returns [”].

The str.split() method has two algorithms. If no arguments are given, it splits on repeated runs of whitespace. However, if an argument is given, it is treated as a single delimiter with no repeated runs.

In the case of splitting an empty string, the first mode (no argument) will return an empty list because the whitespace is eaten and there are no values to put in the result list.

In contrast, the second mode (with an argument such as \n) will produce the first empty field. Consider if you had written '\n'.split('\n'), you would get two fields (one split, gives you two halves).

Question: Is there any specific reason for such a difference?

This first mode is useful when data is aligned in columns with variable amounts of whitespace. For example:

>>> data = '''\
Shasta      California     14,200
McKinley    Alaska         20,300
Fuji        Japan          12,400
'''
>>> for line in data.splitlines():
        print line.split()

['Shasta', 'California', '14,200']
['McKinley', 'Alaska', '20,300']
['Fuji', 'Japan', '12,400']

The second mode is useful for delimited data such as CSV where repeated commas denote empty fields. For example:

>>> data = '''\
Guido,BDFL,,Amsterdam
Barry,FLUFL,,USA
Tim,,,USA
'''
>>> for line in data.splitlines():
        print line.split(',')

['Guido', 'BDFL', '', 'Amsterdam']
['Barry', 'FLUFL', '', 'USA']
['Tim', '', '', 'USA']

Note, the number of result fields is one greater than the number of delimiters. Think of cutting a rope. If you make no cuts, you have one piece. Making one cut, gives two pieces. Making two cuts, gives three pieces. And so it is with Python’s str.split(delimiter) method:

>>> ''.split(',')       # No cuts
['']
>>> ','.split(',')      # One cut
['', '']
>>> ',,'.split(',')     # Two cuts
['', '', '']

Question: And is there any more convenient way to count lines in a string?

Yes, there are a couple of easy ways. One uses str.count() and the other uses str.splitlines(). Both ways will give the same answer unless the final line is missing the \n. If the final newline is missing, the str.splitlines approach will give the accurate answer. A faster technique that is also accurate uses the count method but then corrects it for the final newline:

>>> data = '''\
Line 1
Line 2
Line 3
Line 4'''

>>> data.count('\n')                               # Inaccurate
3
>>> len(data.splitlines())                         # Accurate, but slow
4
>>> data.count('\n') + (not data.endswith('\n'))   # Accurate and fast
4    

Question from @Kaz: Why the heck are two very different algorithms shoe-horned into a single function?

The signature for str.split is about 20 years old, and a number of the APIs from that era are strictly pragmatic. While not perfect, the method signature isn’t “terrible” either. For the most part, Guido’s API design choices have stood the test of time.

The current API is not without advantages. Consider strings such as:

ps_aux_header  = "USER               PID  %CPU %MEM      VSZ"
patient_header = "name,age,height,weight"

When asked to break these strings into fields, people tend to describe both using the same English word, “split”. When asked to read code such as fields = line.split() or fields = line.split(','), people tend to correctly interpret the statements as “splits a line into fields”.

Microsoft Excel’s text-to-columns tool made a similar API choice and incorporates both splitting algorithms in the same tool. People seem to mentally model field-splitting as a single concept even though more than one algorithm is involved.


回答 1

根据文档,这似乎只是它应该工作的方式:

使用指定的分隔符分割空字符串将返回['']

如果未指定sep或为None,则将应用不同的拆分算法:连续的空白行将被视为单个分隔符,并且如果字符串的开头或结尾处有空白,则结果在开头或结尾将不包含空字符串。因此,使用None分隔符拆分空字符串或仅包含空格的字符串将返回[]。

因此,为了更清楚一点,该split()函数实现了两种不同的拆分算法,并使用参数的存在来决定要运行哪个参数。这可能是因为它允许优化一个不带参数的参数,而不是优化带参数的参数。我不知道。

It seems to simply be the way it’s supposed to work, according to the documentation:

Splitting an empty string with a specified separator returns [''].

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

So, to make it clearer, the split() function implements two different splitting algorithms, and uses the presence of an argument to decide which one to run. This might be because it allows optimizing the one for no arguments more than the one with arguments; I don’t know.


回答 2

.split()没有参数的人会变得聪明。它在任何空格,制表符,空格,换行符等处分割,并因此跳过所有空字符串。

>>> "  fii    fbar \n bopp ".split()
['fii', 'fbar', 'bopp']

本质上,.split()不带参数的用于从字符串中提取单词,而.split()带参数的参数只是带一个字符串并将其分割。

这就是差异的原因。

是的,通过分割来计数行不是一种有效的方法。计算换行符的数量,如果字符串不以换行符结尾,则加一个。

.split() without parameters tries to be clever. It splits on any whitespace, tabs, spaces, line feeds etc, and it also skips all empty strings as a result of this.

>>> "  fii    fbar \n bopp ".split()
['fii', 'fbar', 'bopp']

Essentially, .split() without parameters are used to extract words from a string, as opposed to .split() with parameters which just takes a string and splits it.

That’s the reason for the difference.

And yeah, counting lines by splitting is not an efficient way. Count the number of line feeds, and add one if the string doesn’t end with a line feed.


回答 3

用途count()

s = "Line 1\nLine2\nLine3"
n_lines = s.count('\n') + 1

Use count():

s = "Line 1\nLine2\nLine3"
n_lines = s.count('\n') + 1

回答 4

>>> print str.split.__doc__
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string.  If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

注意最后一句话。

要计算行数,您可以简单地计算行数\n

line_count = some_string.count('\n') + some_string[-1] != '\n'

最后一部分考虑到不结束最后一行\n,即使这意味着,Hello, World!Hello, World!\n具有相同的行数(这对我来说是合理的),否则,你可以简单地添加1到的计数\n

>>> print str.split.__doc__
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string.  If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

Note the last sentence.

To count lines you can simply count how many \n are there:

line_count = some_string.count('\n') + some_string[-1] != '\n'

The last part takes into account the last line that do not end with \n, even though this means that Hello, World! and Hello, World!\n have the same line count(which for me is reasonable), otherwise you can simply add 1 to the count of \n.


回答 5

要计算行数,可以计算换行数:

n_lines = sum(1 for s in the_string if s == "\n") + 1 # add 1 for last line

编辑

内置的另一个答案count更合适,实际上

To count lines, you can count the number of line breaks:

n_lines = sum(1 for s in the_string if s == "\n") + 1 # add 1 for last line

Edit:

The other answer with built-in count is more suitable, actually