Is there a regular expression which matches a single grapheme cluster?

up vote
3
down vote

favorite

Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.

From Unicode® Standard Annex #29:

It is important to recognize that what the user thinks of as a
“character”—a basic unit of a writing system for a language—may not be
just a single Unicode code point. Instead, that basic unit may be made
up of multiple Unicode code points. To avoid ambiguity with the
computer use of the term character, this is called a user-perceived
character. For example, “G” + grave-accent is a user-perceived
character: users think of it as a single character, yet is actually
represented by two Unicode code points. These user-perceived
characters are approximated by what is called a grapheme cluster,
which can be determined programmatically.

Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.

"한bar".match(/*?*/)[0] === "한"

"நிbaz".match(/*?*/)[0] === "நி"

"aa".match(/*?*/)[0] === "a"

"rn".match(/*?*/)[0] === "rn"

"💆‍♂️foo".match(/*?*/)[0] === "💆‍♂️"

edited Nov 8 at 21:32

asked Nov 7 at 21:52

brainkim

3072413

1

You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56

Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40

1

@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26

1

Perl style regular expressions use X to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
– Shawn
Nov 8 at 6:04

X matches all characters (regardless of number of bytes e.g. a) along with grapheme clusters as one match. It works almost the same way as PMpM* which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
– revo
Nov 8 at 11:36

|
show 3 more comments

up vote
3
down vote

favorite

Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.

From Unicode® Standard Annex #29:

It is important to recognize that what the user thinks of as a
“character”—a basic unit of a writing system for a language—may not be
just a single Unicode code point. Instead, that basic unit may be made
up of multiple Unicode code points. To avoid ambiguity with the
computer use of the term character, this is called a user-perceived
character. For example, “G” + grave-accent is a user-perceived
character: users think of it as a single character, yet is actually
represented by two Unicode code points. These user-perceived
characters are approximated by what is called a grapheme cluster,
which can be determined programmatically.

Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.

"한bar".match(/*?*/)[0] === "한"

"நிbaz".match(/*?*/)[0] === "நி"

"aa".match(/*?*/)[0] === "a"

"rn".match(/*?*/)[0] === "rn"

"💆‍♂️foo".match(/*?*/)[0] === "💆‍♂️"

edited Nov 8 at 21:32

asked Nov 7 at 21:52

brainkim

3072413

1

You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56

Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40

1

@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26

1

Perl style regular expressions use X to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
– Shawn
Nov 8 at 6:04

X matches all characters (regardless of number of bytes e.g. a) along with grapheme clusters as one match. It works almost the same way as PMpM* which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
– revo
Nov 8 at 11:36

|
show 3 more comments

up vote
3
down vote

favorite

Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.

From Unicode® Standard Annex #29:

It is important to recognize that what the user thinks of as a
“character”—a basic unit of a writing system for a language—may not be
just a single Unicode code point. Instead, that basic unit may be made
up of multiple Unicode code points. To avoid ambiguity with the
computer use of the term character, this is called a user-perceived
character. For example, “G” + grave-accent is a user-perceived
character: users think of it as a single character, yet is actually
represented by two Unicode code points. These user-perceived
characters are approximated by what is called a grapheme cluster,
which can be determined programmatically.

Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.

"한bar".match(/*?*/)[0] === "한"

"நிbaz".match(/*?*/)[0] === "நி"

"aa".match(/*?*/)[0] === "a"

"rn".match(/*?*/)[0] === "rn"

"💆‍♂️foo".match(/*?*/)[0] === "💆‍♂️"

edited Nov 8 at 21:32

asked Nov 7 at 21:52

brainkim

3072413

Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.

From Unicode® Standard Annex #29:

It is important to recognize that what the user thinks of as a
“character”—a basic unit of a writing system for a language—may not be
just a single Unicode code point. Instead, that basic unit may be made
up of multiple Unicode code points. To avoid ambiguity with the
computer use of the term character, this is called a user-perceived
character. For example, “G” + grave-accent is a user-perceived
character: users think of it as a single character, yet is actually
represented by two Unicode code points. These user-perceived
characters are approximated by what is called a grapheme cluster,
which can be determined programmatically.

Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.

"한bar".match(/*?*/)[0] === "한"

"நிbaz".match(/*?*/)[0] === "நி"

"aa".match(/*?*/)[0] === "a"

"rn".match(/*?*/)[0] === "rn"

"💆‍♂️foo".match(/*?*/)[0] === "💆‍♂️"

javascript regex unicode

edited Nov 8 at 21:32

asked Nov 7 at 21:52

brainkim

3072413

edited Nov 8 at 21:32

asked Nov 7 at 21:52

brainkim

3072413

edited Nov 8 at 21:32

asked Nov 7 at 21:52

brainkim

3072413

asked Nov 7 at 21:52

brainkim

3072413

asked Nov 7 at 21:52

brainkim

3072413

1

You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56

Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40

1

@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26

1

Perl style regular expressions use X to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
– Shawn
Nov 8 at 6:04

X matches all characters (regardless of number of bytes e.g. a) along with grapheme clusters as one match. It works almost the same way as PMpM* which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
– revo
Nov 8 at 11:36

|
show 3 more comments

1

You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56

Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40

1

@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26

1

Perl style regular expressions use X to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
– Shawn
Nov 8 at 6:04

X matches all characters (regardless of number of bytes e.g. a) along with grapheme clusters as one match. It works almost the same way as PMpM* which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
– revo
Nov 8 at 11:36

You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56

Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40

@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26

Perl style regular expressions use X to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
– Shawn
Nov 8 at 6:04

X matches all characters (regardless of number of bytes e.g. a) along with grapheme clusters as one match. It works almost the same way as PMpM* which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
– revo
Nov 8 at 11:36

|
show 3 more comments

1 Answer
1

active

oldest

votes

up vote
0
down vote

Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:

Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.

X is the closest, and does not exist in any version through ES6. P{M}p{M}+ approximates X, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(P{Mark})(p{Mark}+)/gu.

But even still, that isn't sufficient. <== Read that link for all the gory details.

A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.

edited Nov 8 at 14:55

answered Nov 8 at 14:25

bishop

23k46087

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53198407%2fis-there-a-regular-expression-which-matches-a-single-grapheme-cluster%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:

Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.

But even still, that isn't sufficient. <== Read that link for all the gory details.

A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.

edited Nov 8 at 14:55

answered Nov 8 at 14:25

bishop

23k46087

add a comment |

up vote
0
down vote

Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:

Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.

But even still, that isn't sufficient. <== Read that link for all the gory details.

A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.

edited Nov 8 at 14:55

answered Nov 8 at 14:25

bishop

23k46087

add a comment |

up vote
0
down vote

Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:

Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.

But even still, that isn't sufficient. <== Read that link for all the gory details.

A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.

edited Nov 8 at 14:55

answered Nov 8 at 14:25

bishop

23k46087

Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:

Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.

But even still, that isn't sufficient. <== Read that link for all the gory details.

A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.

edited Nov 8 at 14:55

answered Nov 8 at 14:25

bishop

23k46087

edited Nov 8 at 14:55

answered Nov 8 at 14:25

bishop

23k46087

answered Nov 8 at 14:25

bishop

23k46087

answered Nov 8 at 14:25

bishop

23k46087

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk