Is there a regular expression which matches a single grapheme cluster?











up vote
3
down vote

favorite












Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.



From Unicode® Standard Annex #29:




It is important to recognize that what the user thinks of as a
“character”—a basic unit of a writing system for a language—may not be
just a single Unicode code point. Instead, that basic unit may be made
up of multiple Unicode code points. To avoid ambiguity with the
computer use of the term character, this is called a user-perceived
character. For example, “G” + grave-accent is a user-perceived
character: users think of it as a single character, yet is actually
represented by two Unicode code points. These user-perceived
characters are approximated by what is called a grapheme cluster,
which can be determined programmatically.




Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.



"한bar".match(/*?*/)[0] === "한"
"நிbaz".match(/*?*/)[0] === "நி"
"aa".match(/*?*/)[0] === "a"
"rn".match(/*?*/)[0] === "rn"
"💆‍♂️foo".match(/*?*/)[0] === "💆‍♂️"









share|improve this question




















  • 1




    You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
    – Pointy
    Nov 7 at 21:56










  • Please take a look at the marked question and if it didn't answer your problem edit accordingly.
    – revo
    Nov 7 at 22:40






  • 1




    @revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
    – brainkim
    Nov 8 at 0:26








  • 1




    Perl style regular expressions use X to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
    – Shawn
    Nov 8 at 6:04










  • X matches all characters (regardless of number of bytes e.g. a) along with grapheme clusters as one match. It works almost the same way as PMpM* which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
    – revo
    Nov 8 at 11:36

















up vote
3
down vote

favorite












Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.



From Unicode® Standard Annex #29:




It is important to recognize that what the user thinks of as a
“character”—a basic unit of a writing system for a language—may not be
just a single Unicode code point. Instead, that basic unit may be made
up of multiple Unicode code points. To avoid ambiguity with the
computer use of the term character, this is called a user-perceived
character. For example, “G” + grave-accent is a user-perceived
character: users think of it as a single character, yet is actually
represented by two Unicode code points. These user-perceived
characters are approximated by what is called a grapheme cluster,
which can be determined programmatically.




Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.



"한bar".match(/*?*/)[0] === "한"
"நிbaz".match(/*?*/)[0] === "நி"
"aa".match(/*?*/)[0] === "a"
"rn".match(/*?*/)[0] === "rn"
"💆‍♂️foo".match(/*?*/)[0] === "💆‍♂️"









share|improve this question




















  • 1




    You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
    – Pointy
    Nov 7 at 21:56










  • Please take a look at the marked question and if it didn't answer your problem edit accordingly.
    – revo
    Nov 7 at 22:40






  • 1




    @revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
    – brainkim
    Nov 8 at 0:26








  • 1




    Perl style regular expressions use X to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
    – Shawn
    Nov 8 at 6:04










  • X matches all characters (regardless of number of bytes e.g. a) along with grapheme clusters as one match. It works almost the same way as PMpM* which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
    – revo
    Nov 8 at 11:36















up vote
3
down vote

favorite









up vote
3
down vote

favorite











Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.



From Unicode® Standard Annex #29:




It is important to recognize that what the user thinks of as a
“character”—a basic unit of a writing system for a language—may not be
just a single Unicode code point. Instead, that basic unit may be made
up of multiple Unicode code points. To avoid ambiguity with the
computer use of the term character, this is called a user-perceived
character. For example, “G” + grave-accent is a user-perceived
character: users think of it as a single character, yet is actually
represented by two Unicode code points. These user-perceived
characters are approximated by what is called a grapheme cluster,
which can be determined programmatically.




Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.



"한bar".match(/*?*/)[0] === "한"
"நிbaz".match(/*?*/)[0] === "நி"
"aa".match(/*?*/)[0] === "a"
"rn".match(/*?*/)[0] === "rn"
"💆‍♂️foo".match(/*?*/)[0] === "💆‍♂️"









share|improve this question















Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.



From Unicode® Standard Annex #29:




It is important to recognize that what the user thinks of as a
“character”—a basic unit of a writing system for a language—may not be
just a single Unicode code point. Instead, that basic unit may be made
up of multiple Unicode code points. To avoid ambiguity with the
computer use of the term character, this is called a user-perceived
character. For example, “G” + grave-accent is a user-perceived
character: users think of it as a single character, yet is actually
represented by two Unicode code points. These user-perceived
characters are approximated by what is called a grapheme cluster,
which can be determined programmatically.




Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.



"한bar".match(/*?*/)[0] === "한"
"நிbaz".match(/*?*/)[0] === "நி"
"aa".match(/*?*/)[0] === "a"
"rn".match(/*?*/)[0] === "rn"
"💆‍♂️foo".match(/*?*/)[0] === "💆‍♂️"






javascript regex unicode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 8 at 21:32

























asked Nov 7 at 21:52









brainkim

3072413




3072413








  • 1




    You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
    – Pointy
    Nov 7 at 21:56










  • Please take a look at the marked question and if it didn't answer your problem edit accordingly.
    – revo
    Nov 7 at 22:40






  • 1




    @revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
    – brainkim
    Nov 8 at 0:26








  • 1




    Perl style regular expressions use X to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
    – Shawn
    Nov 8 at 6:04










  • X matches all characters (regardless of number of bytes e.g. a) along with grapheme clusters as one match. It works almost the same way as PMpM* which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
    – revo
    Nov 8 at 11:36
















  • 1




    You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
    – Pointy
    Nov 7 at 21:56










  • Please take a look at the marked question and if it didn't answer your problem edit accordingly.
    – revo
    Nov 7 at 22:40






  • 1




    @revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
    – brainkim
    Nov 8 at 0:26








  • 1




    Perl style regular expressions use X to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
    – Shawn
    Nov 8 at 6:04










  • X matches all characters (regardless of number of bytes e.g. a) along with grapheme clusters as one match. It works almost the same way as PMpM* which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
    – revo
    Nov 8 at 11:36










1




1




You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56




You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56












Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40




Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40




1




1




@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26






@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26






1




1




Perl style regular expressions use X to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
– Shawn
Nov 8 at 6:04




Perl style regular expressions use X to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
– Shawn
Nov 8 at 6:04












X matches all characters (regardless of number of bytes e.g. a) along with grapheme clusters as one match. It works almost the same way as PMpM* which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
– revo
Nov 8 at 11:36






X matches all characters (regardless of number of bytes e.g. a) along with grapheme clusters as one match. It works almost the same way as PMpM* which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
– revo
Nov 8 at 11:36














1 Answer
1






active

oldest

votes

















up vote
0
down vote













Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:




Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.



In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.




X is the closest, and does not exist in any version through ES6. P{M}p{M}+ approximates X, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(P{Mark})(p{Mark}+)/gu.



But even still, that isn't sufficient. <== Read that link for all the gory details.



A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53198407%2fis-there-a-regular-expression-which-matches-a-single-grapheme-cluster%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:




    Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.



    In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.




    X is the closest, and does not exist in any version through ES6. P{M}p{M}+ approximates X, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(P{Mark})(p{Mark}+)/gu.



    But even still, that isn't sufficient. <== Read that link for all the gory details.



    A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.






    share|improve this answer



























      up vote
      0
      down vote













      Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:




      Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.



      In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.




      X is the closest, and does not exist in any version through ES6. P{M}p{M}+ approximates X, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(P{Mark})(p{Mark}+)/gu.



      But even still, that isn't sufficient. <== Read that link for all the gory details.



      A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.






      share|improve this answer

























        up vote
        0
        down vote










        up vote
        0
        down vote









        Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:




        Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.



        In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.




        X is the closest, and does not exist in any version through ES6. P{M}p{M}+ approximates X, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(P{Mark})(p{Mark}+)/gu.



        But even still, that isn't sufficient. <== Read that link for all the gory details.



        A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.






        share|improve this answer














        Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:




        Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.



        In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.




        X is the closest, and does not exist in any version through ES6. P{M}p{M}+ approximates X, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(P{Mark})(p{Mark}+)/gu.



        But even still, that isn't sufficient. <== Read that link for all the gory details.



        A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 8 at 14:55

























        answered Nov 8 at 14:25









        bishop

        23k46087




        23k46087






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53198407%2fis-there-a-regular-expression-which-matches-a-single-grapheme-cluster%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Tangent Lines Diagram Along Smooth Curve

            Yusuf al-Mu'taman ibn Hud

            Zucchini