0

XML: https://raw.githubusercontent.com/dp247/Freeview-EPG/master/epg.xml

Example

<tv>
  <programme channel="VirginRadio.uk" start="20250319220000 +0000" stop="20250320010000 +0000">
    <title lang="en">Olivia Jones</title>
    <desc lang="en">It may be late in the day but Olivia Jones still has a stack full of the best songs around to soundtrack your night.</desc>
    <icon src="https://images.metadata.sky.com/pd-image/e776e89e-47b9-4529-8eec-09589a7bb782/cover"/>
  </programme>
</tv>

I want to print the channel, start time, and title of matching programmes.

I came up with this:

tv / programme / title [ contains(text(), "Olivia") ] / parent::*/concat(@channel, "_", @start, "_", title)

which works at https://www.freeformatter.com/xpath-tester.html

However, it doesn't work with xqor with xmlstarlet.

Can this be done in the shell? What are the other options?

2
  • Aside: with XPath you can use brackets inside brackets so the parent::* is superfluous. Also, the text() isn't really needed: tv/programme[title[contains(.,"Olivia")]]/concat(@channel,"_",@start,"_",title) Commented Mar 18 at 11:14
  • @ClosingVotes "Using a tool" in the shell is somewhat equivalent to calling a library function in other languages. Would that be right to close a question about for eg. a numpy function just because it isn't part of Python's standard library? Commented Mar 18 at 14:54

4 Answers 4

1
xmlstarlet select --template \
  --match "//tv/programme[title[contains(text(),'Olivia')]]" \
  --value-of "concat(@channel,'_',@start,'_',title)" -n file.xml

Output:

VirginRadio.uk_20250319220000 +0000_Olivia Jones
Sign up to request clarification or add additional context in comments.

Comments

1

With xmllint can be done using --shell feature but needs some extra text processing

bpath='tv/programme[title[ contains(text(), "Olivia")]]'
printf "%s\n" "cat $bpath/@channel | $bpath/@start | $bpath/title/text()" "bye" |\
xmllint --shell tmp2.xml | tr -d '"' |\
gawk 'BEGIN{RS=" channel="; FS="\n -+\n( start=)?|\n[/] > "; OFS="|"}{ if(NR > 1) print $1, $2, $3}'

Result

VirginRadio.uk|20250319220000 +0000|Olivia Jones
fm203.uk|20250219220000 +0000|Olivia Newton

Raw ouput from xmllint

printf "%s\n" "cat $bpath/@channel | $bpath/@start | $bpath/title/text()" "bye" | xmllint --shell tmp2.xml

/ > cat tv/programme[title[ contains(text(), "Olivia")]]/@channel | tv/programme[title[ contains(text(), "Olivia")]]/@start | tv/programme[title[ contains(text(), "Olivia")]]/title/text()
 -------
 channel="VirginRadio.uk"
 -------
 start="20250319220000 +0000"
 -------
Olivia Jones
 -------
 channel="fm203.uk"
 -------
 start="20250219220000 +0000"
 -------
Olivia Newton
/ > bye

3 Comments

This seems to return only the first match.
yes, I will fix that later.
@RichardBarraclough fixed.
0

xq (from kislyuk/yq) uses jq under the hood, so descend into .tv.programme[] (use --xml-force-list programme to make sure .programme is an array, even if there's just one child item), then select by condition .title."#text" | contains($q) (with $q being your search query defined with --arg q "Olivia"), and compose the output by concatenating ."@channel", ."@start", and .title."#text", with an underscore joined in between. The -r flag decodes the result into a raw string.

xq --arg q "Olivia" -r --xml-force-list programme '
  .tv.programme[] | select(.title."#text" | contains($q))
  | [."@channel", ."@start", .title."#text"] | join("_")
'

The same can be achieved with yq (from mikefarah/yq) by applying some little tweaks to the jq approach from above: Import values (the query) through the environment and retrieve them using strenv, follow the encoding of text nodes as +content (instead of #text), and the additional + preceding also the @ in attributes names. Then, manually induce the iterability of .programme by addressing both alternatives (using the alternative operator //), and, if needed (depending on file extensions used), implicitly define the input and output formatting (with the -px and -roy flags).

q="Olivia" yq -px -roy '
  .tv.programme | (select(type == "!!seq") | .[]) // .
  | select(.title.+content | contains(strenv(q)))
  | [.+@channel, .+@start, .title.+content] | join("_")
'

Output from both using the input sample:

VirginRadio.uk_20250319220000 +0000_Olivia Jones

2 Comments

I'm on debian so I don't have that xq. The xq on Debian is github.com/sibprogrammer/xq
@Richard As mentioned in my answer (and also explained in the tag info of the xq tag, which you have used), xq is packaged with (kislyuk's) yq. On Debian, this is sources.debian.org/src/yq
-1

I see the answer is still found. As an addition with python:

import requests
import xml.etree.ElementTree as ET
import io
import pandas as pd

url = "https://raw.githubusercontent.com/dp247/Freeview-EPG/master/epg.xml"
response = requests.get(url, stream=True)

xml_stream = io.BytesIO(response.content)
context = ET.iterparse(xml_stream, events=("start", "end"))
data = []  # List to store extracted data

for event, elem in context:
    if event == "start" and elem.tag == "programme":
        channel = elem.attrib.get("channel", "")
        start_time = elem.attrib.get("start", "")
    elif event == "end" and elem.tag == "title":
        if "Olivia" in elem.text:
            data.append({"Title": elem.text, "Channel": channel, "Time": start_time})
    elif event == "end" and elem.tag == "programme":
        elem.clear()  # Free memory

df = pd.DataFrame(data)
print(df.to_string())

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.