1

I would like to split a string containing HTML snippets in characters and the elements.

let str = 'wel<span class="t">come</span> to all'
console.log("split attempt " +JSON.stringify(str.split(/(<.*?>)|/)));

giving me:

split attempt ["w",null,"e",null,"l","<span class=\"t\">","c",null,"o",null,"m",null,"e","</span>"," ",null,"t",null,"o",null," ",null,"a",null,"l",null,"l"]

By filtering out the null, I get what I want:

split attempt ["w","e","l","<span class=\"t\">","c","o","m","e","</span>"," ","t","o"," ","a","l","l"]

But is there some way in the regular expression to filter out specific sequences (like short HTML tags) and split the rest in the vanilla character by character way?

7
  • 1
    What's the ultimate goal for this? Where is your original string coming from? Commented May 19 at 19:15
  • 3
    If you have HTML then you need an HTML parser. Regex cannot reliably work as an HTML parser: there are some explanations in the answers to RegEx match open tags except XHTML self-contained tags. Commented May 19 at 19:23
  • You can use: str.split(/(<[^>]+>|)/).filter(Boolean) but keep in mind that this will work for simple non-nested tags only. Commented May 19 at 19:36
  • @AndrewMorton I'm well aware that regex is not ideal for html parsing, but in this case its about small html elements. If it makes you feel better you can mentally replace the < and > in other chars like Kspan class="t"D. And mentally ignore the problem that the capital K & D cannot be both present. Commented May 19 at 19:41
  • 1
    You could also use match like @trincot suggest with <[^>]*>|[^><] - negation can be more efficient, if this works with your data. Commented May 19 at 20:49

4 Answers 4

2

It seems you want to split into individual characters, except for <...> tags, which you want to treat as atomic.

To avoid the empty strings and nulls, I'd suggest using match instead of split:

const str = 'wel<span class="t">come</span> to all';
console.log(str.match(/<.*?>|./sg));
 

Disclaimer: for any more complex HTML parsing you should use an HTML parser. Think of HTML comments, CDATA blocks, <script> tags with code that has tags in string literals, etc.

Sign up to request clarification or add additional context in comments.

Comments

2

The best HTML parser is browser itself!

let str = `wel<span class="t">come</span> to all 
<div>
Some Div content with <a> A TAG</a> and <div>inner DIV</div>
</div>
`

let el=document.createElement('DIV')
el.innerHTML=str;
console.log(el.innerText)


// Regexp for array with tags

m=str.match(/[^<]|<[^>]*>/g);
console.log(m)

Result:

["w","e","l","<span class=\"t\">","c","o","m","e","</span>"," ","t","o"," ","a","l","l"," ","\n","<div>","\n","S","o","m","e"," ","D","i","v"," ","c","o","n","t","e","n","t"," ","w","i","t","h"," ","<a>"," ","A"," ","T","A","G","</a>"," ","a","n","d"," ","<div>","i","n","n","e","r"," ","D","I","V","</div>","\n","</div>","\n"]

3 Comments

Missing the tags - you can loop over nodes to get both
Yes. loop over nodes/elements not so easy. But removing all the tags - why not.
OP: I would like to split a string containing html snippets in characters and the elements.
1

To elaborate more on suggestion of parse it as HTML:

Regex is bad for parsing HTML

First of all, you should not use regex to parse HTML, it's just bad idea (other SO post: Using regular expressions to parse HTML: why not?)

So in order to do that, I would follow such approach: parse the string as HTML, top level nodes would be either: text nodes (which we then will split to single characters) or just usual HTML elements (which we will leave as is).

So the implementation would something like this:

function splitHtmlIntoCharsAndElements(htmlString) {
    const template = document.createElement('template');
    template.innerHTML = htmlString;

    const result = [];

    function checkAndParseNode(node) {
        if (node.nodeType === Node.TEXT_NODE) {
            // Split plain text into individual characters
            result.push(...node.textContent.split(''));
        } else if (node.nodeType === Node.ELEMENT_NODE) {
            // If it's an element, push the whole element (including its children)
            result.push(node.outerHTML); // get string representation of a node
        }
    }

    // Top-level DOM nodes (text or elements)
    template.content.childNodes.forEach(checkAndParseNode);
    return result;
}

// and you could use it like below

let str = 'wel<span class="t">come</span> to all';
let parts = splitHtmlIntoCharsAndElements(str);

console.log(parts);
\

1 Comment

0

Here I leave a small method that allows you to achieve the desired result, even if you have nested tags.

let str = "wel<<>> popo <span class=\"t\">come</span> to all"
let label = false;
let out = "";
let aux = "";
let nested = 0;

for( letter of str ) {
   if( ! label ) {
        // if "letter" is not “<”, we add it to the output, otherwise, 
        // we declare “label” as “false”, increase “nested” and instantiate 
        // “aux”.
      if( letter !== "<" ) {
         out += "\"" + x + "\",";
      }
      else {
         nested ++;
         aux = "\"<";
         label = true;
      }
   }
   else {
        // add "letter" to "aux"
      aux += letter;

        // if we find a “<” here again, it means that we found a nested tag, 
        // so we increase “nested”.
        // if here again we find a “>”, and “nested” is greater than “1”, 
        // it indicates that it is the closing of an internal tag, so we 
        // only decrease “nested”. 
        // if instead “nested” is “1”, it is the closing of the external 
        // label, so we decrease “nested”, set “label” to “false”, add the 
        // content of “aux” to “out”, and then reset it.
      if( letter === "<" ) nested++;
      else if( letter === ">" && nested > 1 ) nested--;
      else if( letter === ">" && nested === 1 ) {
         label = false;
         out += aux + "\",";
         aux = "";
         nested --;
      }
   }
}
console.log( out );

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.