Is there a regular expression which matches a single grapheme cluster?
up vote
3
down vote
favorite
Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.
From Unicode® Standard Annex #29:
It is important to recognize that what the user thinks of as a
“character”—a basic unit of a writing system for a language—may not be
just a single Unicode code point. Instead, that basic unit may be made
up of multiple Unicode code points. To avoid ambiguity with the
computer use of the term character, this is called a user-perceived
character. For example, “G” + grave-accent is a user-perceived
character: users think of it as a single character, yet is actually
represented by two Unicode code points. These user-perceived
characters are approximated by what is called a grapheme cluster,
which can be determined programmatically.
Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.
"한bar".match(/*?*/)[0] === "한"
"நிbaz".match(/*?*/)[0] === "நி"
"aa".match(/*?*/)[0] === "a"
"rn".match(/*?*/)[0] === "rn"
"💆♂️foo".match(/*?*/)[0] === "💆♂️"
javascript regex unicode
|
show 3 more comments
up vote
3
down vote
favorite
Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.
From Unicode® Standard Annex #29:
It is important to recognize that what the user thinks of as a
“character”—a basic unit of a writing system for a language—may not be
just a single Unicode code point. Instead, that basic unit may be made
up of multiple Unicode code points. To avoid ambiguity with the
computer use of the term character, this is called a user-perceived
character. For example, “G” + grave-accent is a user-perceived
character: users think of it as a single character, yet is actually
represented by two Unicode code points. These user-perceived
characters are approximated by what is called a grapheme cluster,
which can be determined programmatically.
Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.
"한bar".match(/*?*/)[0] === "한"
"நிbaz".match(/*?*/)[0] === "நி"
"aa".match(/*?*/)[0] === "a"
"rn".match(/*?*/)[0] === "rn"
"💆♂️foo".match(/*?*/)[0] === "💆♂️"
javascript regex unicode
1
You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56
Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40
1
@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26
1
Perl style regular expressions useX
to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
– Shawn
Nov 8 at 6:04
X
matches all characters (regardless of number of bytes e.g.a
) along with grapheme clusters as one match. It works almost the same way asPMpM*
which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two thatX
has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
– revo
Nov 8 at 11:36
|
show 3 more comments
up vote
3
down vote
favorite
up vote
3
down vote
favorite
Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.
From Unicode® Standard Annex #29:
It is important to recognize that what the user thinks of as a
“character”—a basic unit of a writing system for a language—may not be
just a single Unicode code point. Instead, that basic unit may be made
up of multiple Unicode code points. To avoid ambiguity with the
computer use of the term character, this is called a user-perceived
character. For example, “G” + grave-accent is a user-perceived
character: users think of it as a single character, yet is actually
represented by two Unicode code points. These user-perceived
characters are approximated by what is called a grapheme cluster,
which can be determined programmatically.
Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.
"한bar".match(/*?*/)[0] === "한"
"நிbaz".match(/*?*/)[0] === "நி"
"aa".match(/*?*/)[0] === "a"
"rn".match(/*?*/)[0] === "rn"
"💆♂️foo".match(/*?*/)[0] === "💆♂️"
javascript regex unicode
Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.
From Unicode® Standard Annex #29:
It is important to recognize that what the user thinks of as a
“character”—a basic unit of a writing system for a language—may not be
just a single Unicode code point. Instead, that basic unit may be made
up of multiple Unicode code points. To avoid ambiguity with the
computer use of the term character, this is called a user-perceived
character. For example, “G” + grave-accent is a user-perceived
character: users think of it as a single character, yet is actually
represented by two Unicode code points. These user-perceived
characters are approximated by what is called a grapheme cluster,
which can be determined programmatically.
Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.
"한bar".match(/*?*/)[0] === "한"
"நிbaz".match(/*?*/)[0] === "நி"
"aa".match(/*?*/)[0] === "a"
"rn".match(/*?*/)[0] === "rn"
"💆♂️foo".match(/*?*/)[0] === "💆♂️"
javascript regex unicode
javascript regex unicode
edited Nov 8 at 21:32
asked Nov 7 at 21:52
brainkim
3072413
3072413
1
You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56
Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40
1
@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26
1
Perl style regular expressions useX
to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
– Shawn
Nov 8 at 6:04
X
matches all characters (regardless of number of bytes e.g.a
) along with grapheme clusters as one match. It works almost the same way asPMpM*
which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two thatX
has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
– revo
Nov 8 at 11:36
|
show 3 more comments
1
You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56
Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40
1
@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26
1
Perl style regular expressions useX
to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.
– Shawn
Nov 8 at 6:04
X
matches all characters (regardless of number of bytes e.g.a
) along with grapheme clusters as one match. It works almost the same way asPMpM*
which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two thatX
has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.
– revo
Nov 8 at 11:36
1
1
You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56
You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56
Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40
Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40
1
1
@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26
@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26
1
1
Perl style regular expressions use
X
to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.– Shawn
Nov 8 at 6:04
Perl style regular expressions use
X
to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.– Shawn
Nov 8 at 6:04
X
matches all characters (regardless of number of bytes e.g. a
) along with grapheme clusters as one match. It works almost the same way as PMpM*
which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X
has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.– revo
Nov 8 at 11:36
X
matches all characters (regardless of number of bytes e.g. a
) along with grapheme clusters as one match. It works almost the same way as PMpM*
which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two that X
has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.– revo
Nov 8 at 11:36
|
show 3 more comments
1 Answer
1
active
oldest
votes
up vote
0
down vote
Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:
Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.
In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.
X
is the closest, and does not exist in any version through ES6. P{M}p{M}+
approximates X
, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(P{Mark})(p{Mark}+)/gu
.
But even still, that isn't sufficient. <== Read that link for all the gory details.
A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator
to break clusters and match manually.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:
Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.
In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.
X
is the closest, and does not exist in any version through ES6. P{M}p{M}+
approximates X
, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(P{Mark})(p{Mark}+)/gu
.
But even still, that isn't sufficient. <== Read that link for all the gory details.
A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator
to break clusters and match manually.
add a comment |
up vote
0
down vote
Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:
Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.
In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.
X
is the closest, and does not exist in any version through ES6. P{M}p{M}+
approximates X
, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(P{Mark})(p{Mark}+)/gu
.
But even still, that isn't sufficient. <== Read that link for all the gory details.
A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator
to break clusters and match manually.
add a comment |
up vote
0
down vote
up vote
0
down vote
Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:
Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.
In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.
X
is the closest, and does not exist in any version through ES6. P{M}p{M}+
approximates X
, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(P{Mark})(p{Mark}+)/gu
.
But even still, that isn't sufficient. <== Read that link for all the gory details.
A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator
to break clusters and match manually.
Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:
Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use X. You can consider X the Unicode version of the dot. There is one difference, though: X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.
In .NET, Java 8 and prior, and Ruby 1.9 you can use P{M}p{M}+ or (?>P{M}p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>P{M}p{M}*)+ as a substitute for X+.
X
is the closest, and does not exist in any version through ES6. P{M}p{M}+
approximates X
, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(P{Mark})(p{Mark}+)/gu
.
But even still, that isn't sufficient. <== Read that link for all the gory details.
A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator
to break clusters and match manually.
edited Nov 8 at 14:55
answered Nov 8 at 14:25
bishop
23k46087
23k46087
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53198407%2fis-there-a-regular-expression-which-matches-a-single-grapheme-cluster%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that.
– Pointy
Nov 7 at 21:56
Please take a look at the marked question and if it didn't answer your problem edit accordingly.
– revo
Nov 7 at 22:40
1
@revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision?
– brainkim
Nov 8 at 0:26
1
Perl style regular expressions use
X
to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though.– Shawn
Nov 8 at 6:04
X
matches all characters (regardless of number of bytes e.g.a
) along with grapheme clusters as one match. It works almost the same way asPMpM*
which is supported by ES6 and could be transpiled to ES5 (for this you could use this tool). But there is a difference between those two thatX
has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See this for more insights on Hangul.– revo
Nov 8 at 11:36