1

this is useful, because I can then do for example this:
xPath->query('//div.class');

So I need regex which do this transforms:

Example 1
text().some_class => text()[contains(concat(" ", @class, " "), " some_class ")]
Example 2: nothing to do – it's in apostrophes
@src = 'obr.gif' => @src = 'obr.gif'
Example 3
*.class => *[contains(concat(" ", @class, " "), " class ")]
Example 4
div.class => div[contains(concat(" ", @class, " "), " class ")]
Example 5: do nothing – missing subject, which should have this class (I know, this is not valid xpath)
div[.neco] => div[.neco]

I used PHP preg_replace this way:

preg_replace(
        '/\.([a-z_][\w-]*)/i',
        '[contains(concat(" ", @class, " "), " $1 ")]',
        $xPath);

That only worked for examples No. 1, 3 and 4. So I updated it:

preg_replace(
        '/(?<=[\w*\])])\.([a-z_][\w-]*)/i',
        '[contains(concat(" ", @class, " "), " $1 ")]',
        $xPath);

Then only No 2 didn't work. I tried this:

preg_replace(
        '/(\'[^\']+\'.*?)*(?<=[\w*\])])\.([a-z_][\w-]*)/i',
        '$1[contains(concat(" ", @class, " "), " $2 ")]',
        $xPath);

That works for:
//div[@src = 'obr.gif'].class => //div[@src = 'obr.gif'][contains(concat(" ", @class, " "), " class ")]
But for (No 2) that do it wrong:
@src = 'obr.gif' => @src = 'obr[contains(concat(" ", @class, " "), " gif ")]'
I realize that PHP tries hard to match at least something, so "ignore" first parentheses, but I don't know, how to make regex which would works according to me.

PS: I'm only using single quotes in xPath expression, thus I do not care about quotes.

EDIT: Modified funkwurm answer for PHP

preg_replace_callback(<<<'CLASS'
        /('|").*?(?<!\\)\1|(?<=[\w*\])])\.([a-z_][\w-]*)/i
CLASS
        , function($matches) {
            return $matches[1] ? $matches[0] : "[contains(concat(\" \", @class, \" \"), \" $matches[2] \")]";
        },
        $xPath
);

I'm using nowdoc syntax for regex entry, because then I don't have to deal with escaping in quoted strings.

1 Answer 1

0

The best approach here is to use a "Match this unless condition A|B" method further explained here and with an example here.

I would make the regex like so:

('|")(?:(?!\\|\1).|\\.)*\1|([\w*\])])\.([a-z_][\w-]*)

Regular expression visualization

Debuggex Demo

In your programming language you then check whether the 2nd capture group has any content. If so, then that is a class in you wanna do your existing substitution. Else you don't wanna do anything, which might mean you replace it with the match itself. A JavaScript implementation below. Note that I get the match m, capture-group of the quote q, the last character before the . in e and capture-group of the class c. If c is undefined I return the entire match m. Else I do the substitution.

var xpaths = [
  'text().some_class',            // => text()[contains(concat(" ", @class, " "), " some_class ")]
  '@src = \'obr.gif\'',           // => @src = 'obr.gif'
  '*.class',                      // => *[contains(concat(" ", @class, " "), " class ")]
  'div.class',                    // => div[contains(concat(" ", @class, " "), " class ")]
  'div[.neco]',                   // => div[.neco]
  'div[@src = \'obr.gif\'].class',// => div[@src = 'obr.gif'][contains(concat(" ", @class, " "), " class ")]
  'div[.//img.class]'             // => div[.//img[contains(concat(" ", @class, " "), " class ")]]
];

document.getElementById('out').value=xpaths.map(function(str) {
  return str.replace(/('|")(?:(?!\\|\1).|\\.)*\1|([\w*\])])\.([a-z_][\w-]*)/ig, function(m, q, e, c) {
    return (c==undefined)?m:(e+'[contains(concat(" ", @class, " "), " ' + c + ' ")]');
  });
}).join('\n');
<textarea id="out" rows="10" style="width:100%"></textarea>

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for repair in class matching rule (underline). :-) But I do not understand good this \[[^\]]*\] alternative because it cause fail with this xPath expression div[.//img.class], where class of image isn't handled.
Thank You, I'm able to modificate you regex to work with previous case and slightly simpilfy it with use of look-behind. I removed \[[^\]]*\] part, instead of I added back (?<=[\w*\])]). That resolve previous case. This part (['"])(?:(?!\\|\1).|\\.)*\1 I simplified into ('|").*?(?<!\\)\1 which is shorter, thus easier to understand (I hope it's nearly equivalent). Your solution deal with double quotes and escaped quotes, good. :-)
Ah, I thought that anything between [] would assumed to not be a class. But checking if it's preceded by [\w*\])] works too, I changed my answer to have it work in JavaScript without lookbehind. If my answer helped you, could you click the "accept"? It helps us both and the site :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.