Python: using split() to split a string at 2 separate points

Question

I have a string that I need to split at 2 separate parts, but all I find is how to split the string using identifiers like "," and other punctuation.

string = "<p>The brown dog jumped over the... <a href="https://google.com" target="something">... but then splashed in the water<p>

hyperlink = re.split(r'(?=https)',string)

print(hyperlink[0])

In the example above, I need to extract just the url in the string "https://google.com" then print out. However, I can only find out how to split the string at "https", so everything past the url comes with it.

I hope this makes sense. After a bunch of searching and testing I can figure out how to do this.

did you consider, for parsing html data, using html parser? docs.python.org/3/library/html.parser.html exists — KamilCuk
– KamilCuk, Commented Mar 3 at 16:15
Wait, you edited your question, and it become a chameleon. I did not notice. Please ask a separate question for your new question. See meta.stackoverflow.com/questions/266767/… . Kindly restore your question before the edit, the answer below is already accepted, and ask a new question. — KamilCuk
– KamilCuk, Commented Mar 3 at 16:22

EuanG · Accepted Answer · 2025-03-03 16:23:43Z

2

There are many ways this can be achieved but a simple one is using find() and then slicing. find() will find the starting position of a substring in a string. using this you can then slice there. e.g.

string = '<p>The brown dog jumped over the... <a href="https://google.com" target="something">... but then splashed in the water<p>'

# Find where the URL starts
start_word = "https"
start_index = string.find(start_word)

# For URLs, we need to find where it ends - usually at a quote mark
end_index = string.find('"', start_index)

# Extract just the URL
result = string[start_index:end_index]

print(result)

Output:

"https://google.com"

The find() method returns the index where the substring begins. Then, using these positions, we slice the string to extract just the section we want.

edited Mar 3 at 16:23

answered Mar 3 at 16:07

EuanG

1,5551 gold badge14 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Justin Bertsch Mar 3 at 16:13

Thanks @EuanG. I will give this a try. I edited my initial question to make it for unique for my situation instead of hypothetical. I also added in what I tried. Thanks again!

EuanG Mar 3 at 16:23

@JustinBertsch modified to follow new example

jackal · Accepted Answer · 2025-03-03 16:36:37Z

1

There are various regular expressions and functions from the re module that will achieve your objective.

Here's one:

import re

string = '<p>The brown dog jumped over the... <a href="https://google.com" target="something">... but then splashed in the water<p>'

m = re.findall(r'^.*href="(.*)"\s.*$', string)

print(*m)

Output:

https://google.com

If you prefer not to use re then:

kw = 'href="'
start = string.find(kw) + len(kw)
end = string[start:].find('"')
result = string[start : end + start]
print(result)

...will give the same output.

edited Mar 3 at 16:36

answered Mar 3 at 16:28

jackal

29.1k3 gold badges9 silver badges28 bronze badges

Comments

furas · Accepted Answer · 2025-03-03 17:43:32Z

0

As someone suggested in commen you could also use modules for parsing xml or html
like lxm or BeautifulSoup - and sometimes it is simpler method.

from bs4 import BeautifulSoup

html = '<p>The brown dog jumped over the... <a href="https://google.com" target="something">... but then splashed in the water<p>'

soup = BeautifulSoup(html, 'html.parser')

hyperlink = soup.find('a').attrs['href']

#target = soup.find('a').attrs['target']

answered Mar 3 at 17:43

furas

149k12 gold badges121 silver badges171 bronze badges

Comments

PaulMcG · Accepted Answer · 2025-04-17 22:30:58Z

The parser expressions that pyparsing creates to match HTML tags avoid many of the classical issues with using tools like regex to parse HTML:

handles case insensitivity (of tags and tag attribute names)
handles quoted and unquoted attributevalues
detects closed tags (opening tags that end with '/')
ignores embedded whitespace

In this case, we just need to search for an <a> tag, and let pyparsing grab the tag attributes, as attributes on the parsed result:

string = """<p>The brown dog jumped over the... <a href="https://google.com" target="something">... but then splashed in the water<p>"""

import pyparsing as pp

# make_html_tags returns a pair of parser expressions, one for the opening tag 
# and one for the matching closing tag - we just need the opening tag
a_tag, _ = pp.make_html_tags("a")

# search_string will return a sequence of all matches, like re.findall
anchor = a_tag.search_string(string)[0]
print(anchor.href)
# https://google.com

print(anchor.target)
# something

Collectives™ on Stack Overflow

Python: using split() to split a string at 2 separate points

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related