-1

I have a string that I need to split at 2 separate parts, but all I find is how to split the string using identifiers like "," and other punctuation.

string = "<p>The brown dog jumped over the... <a href="https://google.com" target="something">... but then splashed in the water<p>

hyperlink = re.split(r'(?=https)',string)

print(hyperlink[0])

In the example above, I need to extract just the url in the string "https://google.com" then print out. However, I can only find out how to split the string at "https", so everything past the url comes with it.

I hope this makes sense. After a bunch of searching and testing I can figure out how to do this.

2
  • 1
    did you consider, for parsing html data, using html parser? docs.python.org/3/library/html.parser.html exists Commented Mar 3 at 16:15
  • 1
    Wait, you edited your question, and it become a chameleon. I did not notice. Please ask a separate question for your new question. See meta.stackoverflow.com/questions/266767/… . Kindly restore your question before the edit, the answer below is already accepted, and ask a new question. Commented Mar 3 at 16:22

4 Answers 4

2

There are many ways this can be achieved but a simple one is using find() and then slicing. find() will find the starting position of a substring in a string. using this you can then slice there. e.g.

string = '<p>The brown dog jumped over the... <a href="https://google.com" target="something">... but then splashed in the water<p>'

# Find where the URL starts
start_word = "https"
start_index = string.find(start_word)

# For URLs, we need to find where it ends - usually at a quote mark
end_index = string.find('"', start_index)

# Extract just the URL
result = string[start_index:end_index]

print(result)

Output:

"https://google.com"

The find() method returns the index where the substring begins. Then, using these positions, we slice the string to extract just the section we want.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @EuanG. I will give this a try. I edited my initial question to make it for unique for my situation instead of hypothetical. I also added in what I tried. Thanks again!
@JustinBertsch modified to follow new example
1

There are various regular expressions and functions from the re module that will achieve your objective.

Here's one:

import re

string = '<p>The brown dog jumped over the... <a href="https://google.com" target="something">... but then splashed in the water<p>'

m = re.findall(r'^.*href="(.*)"\s.*$', string)

print(*m)

Output:

https://google.com

If you prefer not to use re then:

kw = 'href="'
start = string.find(kw) + len(kw)
end = string[start:].find('"')
result = string[start : end + start]
print(result)

...will give the same output.

Comments

0

As someone suggested in commen you could also use modules for parsing xml or html
like lxm or BeautifulSoup - and sometimes it is simpler method.

from bs4 import BeautifulSoup

html = '<p>The brown dog jumped over the... <a href="https://google.com" target="something">... but then splashed in the water<p>'

soup = BeautifulSoup(html, 'html.parser')

hyperlink = soup.find('a').attrs['href']

#target = soup.find('a').attrs['target']

Comments

0

The parser expressions that pyparsing creates to match HTML tags avoid many of the classical issues with using tools like regex to parse HTML:

  • handles case insensitivity (of tags and tag attribute names)

  • handles quoted and unquoted attributevalues

  • detects closed tags (opening tags that end with '/')

  • ignores embedded whitespace

In this case, we just need to search for an <a> tag, and let pyparsing grab the tag attributes, as attributes on the parsed result:

string = """<p>The brown dog jumped over the... <a href="https://google.com" target="something">... but then splashed in the water<p>"""

import pyparsing as pp

# make_html_tags returns a pair of parser expressions, one for the opening tag 
# and one for the matching closing tag - we just need the opening tag
a_tag, _ = pp.make_html_tags("a")

# search_string will return a sequence of all matches, like re.findall
anchor = a_tag.search_string(string)[0]
print(anchor.href)
# https://google.com

print(anchor.target)
# something

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.