What encoding are character / string literals stored in? (Or how to find a literal character in a string from...
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
As we know, different encodings map different representations to same characters. Using setlocale
we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!
This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");
) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?
In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "");
// Read line and convert it to wide string so that wcschr can be used
// So many lines! And that's even though I'm omitting the necessary
// error checking for brevity. Ah I'm also omitting free's
char *s = NULL; size_t n = 0;
getline(&s, &n, stdin);
mbstate_t st = {0}; const char* cs = s;
size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));
st = (mbstate_t){0};
mbsrtowcs(ws, &cs, (wn+1), &st);
int contains_guitar = (wcschr(ws, L'🎸') != NULL);
if(contains_guitar)
printf("Let's rock!n");
else
printf("Let's not.n");
return 0;
}
How to do this correctly?
c string encoding locale string-literals
add a comment |
As we know, different encodings map different representations to same characters. Using setlocale
we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!
This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");
) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?
In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "");
// Read line and convert it to wide string so that wcschr can be used
// So many lines! And that's even though I'm omitting the necessary
// error checking for brevity. Ah I'm also omitting free's
char *s = NULL; size_t n = 0;
getline(&s, &n, stdin);
mbstate_t st = {0}; const char* cs = s;
size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));
st = (mbstate_t){0};
mbsrtowcs(ws, &cs, (wn+1), &st);
int contains_guitar = (wcschr(ws, L'🎸') != NULL);
if(contains_guitar)
printf("Let's rock!n");
else
printf("Let's not.n");
return 0;
}
How to do this correctly?
c string encoding locale string-literals
add a comment |
As we know, different encodings map different representations to same characters. Using setlocale
we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!
This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");
) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?
In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "");
// Read line and convert it to wide string so that wcschr can be used
// So many lines! And that's even though I'm omitting the necessary
// error checking for brevity. Ah I'm also omitting free's
char *s = NULL; size_t n = 0;
getline(&s, &n, stdin);
mbstate_t st = {0}; const char* cs = s;
size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));
st = (mbstate_t){0};
mbsrtowcs(ws, &cs, (wn+1), &st);
int contains_guitar = (wcschr(ws, L'🎸') != NULL);
if(contains_guitar)
printf("Let's rock!n");
else
printf("Let's not.n");
return 0;
}
How to do this correctly?
c string encoding locale string-literals
As we know, different encodings map different representations to same characters. Using setlocale
we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!
This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");
) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?
In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "");
// Read line and convert it to wide string so that wcschr can be used
// So many lines! And that's even though I'm omitting the necessary
// error checking for brevity. Ah I'm also omitting free's
char *s = NULL; size_t n = 0;
getline(&s, &n, stdin);
mbstate_t st = {0}; const char* cs = s;
size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));
st = (mbstate_t){0};
mbsrtowcs(ws, &cs, (wn+1), &st);
int contains_guitar = (wcschr(ws, L'🎸') != NULL);
if(contains_guitar)
printf("Let's rock!n");
else
printf("Let's not.n");
return 0;
}
How to do this correctly?
c string encoding locale string-literals
c string encoding locale string-literals
asked Nov 24 '18 at 12:11
gaazkamgaazkam
2,3481039
2,3481039
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?
No. String literals use the execution character set, which is defined by your compiler at compile time.
Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.
The following snippet seems to work for me. But doesn't it work only because of coincidence?
The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.
How to do this correctly?
Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.
For wide input and output, see e.g. this example in another answer here.
Do note that getwline()
and getwdelim()
are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc()
is recommended instead. (One based on fgetws()
, wcslen()
, and/or wcscspn()
will not be able to handle embedded nuls, L''
, correctly.)
In a typical wide I/O program, you only need mbstowcs()
to convert command-line arguments and environment variables to wide strings.
Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.
Your program can even use e.g.
if (!setlocale(LC_ALL, ""))
fprintf(stderr, "Warning: Your C library does not support your current locale.n");
if (strcmp("UTF-8", nl_langinfo(CODESET)))
fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");
to verify the current locale uses UTF-8.
I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.
If a "proper" example ofgetwline()
and counting the number of Unicode U+1F3B8🎸
glyphs would help, let me know in a comment.
– Nominal Animal
Nov 24 '18 at 13:15
add a comment |
If you're willing to assume UTF-8,
strstr(s,"🎸")
Or:
strstr(s,u8"🎸")
The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:
strstr(s,"360237216270")
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53458016%2fwhat-encoding-are-character-string-literals-stored-in-or-how-to-find-a-liter%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?
No. String literals use the execution character set, which is defined by your compiler at compile time.
Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.
The following snippet seems to work for me. But doesn't it work only because of coincidence?
The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.
How to do this correctly?
Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.
For wide input and output, see e.g. this example in another answer here.
Do note that getwline()
and getwdelim()
are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc()
is recommended instead. (One based on fgetws()
, wcslen()
, and/or wcscspn()
will not be able to handle embedded nuls, L''
, correctly.)
In a typical wide I/O program, you only need mbstowcs()
to convert command-line arguments and environment variables to wide strings.
Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.
Your program can even use e.g.
if (!setlocale(LC_ALL, ""))
fprintf(stderr, "Warning: Your C library does not support your current locale.n");
if (strcmp("UTF-8", nl_langinfo(CODESET)))
fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");
to verify the current locale uses UTF-8.
I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.
If a "proper" example ofgetwline()
and counting the number of Unicode U+1F3B8🎸
glyphs would help, let me know in a comment.
– Nominal Animal
Nov 24 '18 at 13:15
add a comment |
Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?
No. String literals use the execution character set, which is defined by your compiler at compile time.
Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.
The following snippet seems to work for me. But doesn't it work only because of coincidence?
The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.
How to do this correctly?
Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.
For wide input and output, see e.g. this example in another answer here.
Do note that getwline()
and getwdelim()
are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc()
is recommended instead. (One based on fgetws()
, wcslen()
, and/or wcscspn()
will not be able to handle embedded nuls, L''
, correctly.)
In a typical wide I/O program, you only need mbstowcs()
to convert command-line arguments and environment variables to wide strings.
Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.
Your program can even use e.g.
if (!setlocale(LC_ALL, ""))
fprintf(stderr, "Warning: Your C library does not support your current locale.n");
if (strcmp("UTF-8", nl_langinfo(CODESET)))
fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");
to verify the current locale uses UTF-8.
I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.
If a "proper" example ofgetwline()
and counting the number of Unicode U+1F3B8🎸
glyphs would help, let me know in a comment.
– Nominal Animal
Nov 24 '18 at 13:15
add a comment |
Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?
No. String literals use the execution character set, which is defined by your compiler at compile time.
Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.
The following snippet seems to work for me. But doesn't it work only because of coincidence?
The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.
How to do this correctly?
Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.
For wide input and output, see e.g. this example in another answer here.
Do note that getwline()
and getwdelim()
are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc()
is recommended instead. (One based on fgetws()
, wcslen()
, and/or wcscspn()
will not be able to handle embedded nuls, L''
, correctly.)
In a typical wide I/O program, you only need mbstowcs()
to convert command-line arguments and environment variables to wide strings.
Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.
Your program can even use e.g.
if (!setlocale(LC_ALL, ""))
fprintf(stderr, "Warning: Your C library does not support your current locale.n");
if (strcmp("UTF-8", nl_langinfo(CODESET)))
fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");
to verify the current locale uses UTF-8.
I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.
Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?
No. String literals use the execution character set, which is defined by your compiler at compile time.
Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.
The following snippet seems to work for me. But doesn't it work only because of coincidence?
The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.
How to do this correctly?
Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.
For wide input and output, see e.g. this example in another answer here.
Do note that getwline()
and getwdelim()
are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc()
is recommended instead. (One based on fgetws()
, wcslen()
, and/or wcscspn()
will not be able to handle embedded nuls, L''
, correctly.)
In a typical wide I/O program, you only need mbstowcs()
to convert command-line arguments and environment variables to wide strings.
Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.
Your program can even use e.g.
if (!setlocale(LC_ALL, ""))
fprintf(stderr, "Warning: Your C library does not support your current locale.n");
if (strcmp("UTF-8", nl_langinfo(CODESET)))
fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");
to verify the current locale uses UTF-8.
I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.
answered Nov 24 '18 at 13:13
Nominal AnimalNominal Animal
30.5k33463
30.5k33463
If a "proper" example ofgetwline()
and counting the number of Unicode U+1F3B8🎸
glyphs would help, let me know in a comment.
– Nominal Animal
Nov 24 '18 at 13:15
add a comment |
If a "proper" example ofgetwline()
and counting the number of Unicode U+1F3B8🎸
glyphs would help, let me know in a comment.
– Nominal Animal
Nov 24 '18 at 13:15
If a "proper" example of
getwline()
and counting the number of Unicode U+1F3B8 🎸
glyphs would help, let me know in a comment.– Nominal Animal
Nov 24 '18 at 13:15
If a "proper" example of
getwline()
and counting the number of Unicode U+1F3B8 🎸
glyphs would help, let me know in a comment.– Nominal Animal
Nov 24 '18 at 13:15
add a comment |
If you're willing to assume UTF-8,
strstr(s,"🎸")
Or:
strstr(s,u8"🎸")
The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:
strstr(s,"360237216270")
add a comment |
If you're willing to assume UTF-8,
strstr(s,"🎸")
Or:
strstr(s,u8"🎸")
The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:
strstr(s,"360237216270")
add a comment |
If you're willing to assume UTF-8,
strstr(s,"🎸")
Or:
strstr(s,u8"🎸")
The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:
strstr(s,"360237216270")
If you're willing to assume UTF-8,
strstr(s,"🎸")
Or:
strstr(s,u8"🎸")
The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:
strstr(s,"360237216270")
answered Nov 24 '18 at 14:12
R..R..
158k27263566
158k27263566
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53458016%2fwhat-encoding-are-character-string-literals-stored-in-or-how-to-find-a-liter%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown