What encoding are character / string literals stored in? (Or how to find a literal character in a string from...

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

As we know, different encodings map different representations to same characters. Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!

This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?

In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?

#include <stdio.h>

#include <wchar.h>

#include <stdlib.h>

#include <locale.h>



int main()

{

        setlocale(LC_ALL, "");



        // Read line and convert it to wide string so that wcschr can be used

        // So many lines! And that's even though I'm omitting the necessary

        // error checking for brevity. Ah I'm also omitting free's

        char *s = NULL; size_t n = 0;

        getline(&s, &n, stdin);

        mbstate_t st = {0}; const char* cs = s;

        size_t wn = mbsrtowcs(NULL, &cs, 0, &st);

        wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));

        st = (mbstate_t){0};

        mbsrtowcs(ws, &cs, (wn+1), &st);



        int contains_guitar = (wcschr(ws, L'🎸') != NULL);

        if(contains_guitar)

                printf("Let's rock!n");

        else

                printf("Let's not.n");

        return 0;

}

How to do this correctly?

asked Nov 24 '18 at 12:11

gaazkam

2,3481039

add a comment |

#include <stdio.h>

#include <wchar.h>

#include <stdlib.h>

#include <locale.h>



int main()

{

        setlocale(LC_ALL, "");



        // Read line and convert it to wide string so that wcschr can be used

        // So many lines! And that's even though I'm omitting the necessary

        // error checking for brevity. Ah I'm also omitting free's

        char *s = NULL; size_t n = 0;

        getline(&s, &n, stdin);

        mbstate_t st = {0}; const char* cs = s;

        size_t wn = mbsrtowcs(NULL, &cs, 0, &st);

        wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));

        st = (mbstate_t){0};

        mbsrtowcs(ws, &cs, (wn+1), &st);



        int contains_guitar = (wcschr(ws, L'🎸') != NULL);

        if(contains_guitar)

                printf("Let's rock!n");

        else

                printf("Let's not.n");

        return 0;

}

How to do this correctly?

asked Nov 24 '18 at 12:11

gaazkam

2,3481039

add a comment |

#include <stdio.h>

#include <wchar.h>

#include <stdlib.h>

#include <locale.h>



int main()

{

        setlocale(LC_ALL, "");



        // Read line and convert it to wide string so that wcschr can be used

        // So many lines! And that's even though I'm omitting the necessary

        // error checking for brevity. Ah I'm also omitting free's

        char *s = NULL; size_t n = 0;

        getline(&s, &n, stdin);

        mbstate_t st = {0}; const char* cs = s;

        size_t wn = mbsrtowcs(NULL, &cs, 0, &st);

        wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));

        st = (mbstate_t){0};

        mbsrtowcs(ws, &cs, (wn+1), &st);



        int contains_guitar = (wcschr(ws, L'🎸') != NULL);

        if(contains_guitar)

                printf("Let's rock!n");

        else

                printf("Let's not.n");

        return 0;

}

How to do this correctly?

asked Nov 24 '18 at 12:11

gaazkam

2,3481039

#include <stdio.h>

#include <wchar.h>

#include <stdlib.h>

#include <locale.h>



int main()

{

        setlocale(LC_ALL, "");



        // Read line and convert it to wide string so that wcschr can be used

        // So many lines! And that's even though I'm omitting the necessary

        // error checking for brevity. Ah I'm also omitting free's

        char *s = NULL; size_t n = 0;

        getline(&s, &n, stdin);

        mbstate_t st = {0}; const char* cs = s;

        size_t wn = mbsrtowcs(NULL, &cs, 0, &st);

        wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));

        st = (mbstate_t){0};

        mbsrtowcs(ws, &cs, (wn+1), &st);



        int contains_guitar = (wcschr(ws, L'🎸') != NULL);

        if(contains_guitar)

                printf("Let's rock!n");

        else

                printf("Let's not.n");

        return 0;

}

How to do this correctly?

c string encoding locale string-literals

asked Nov 24 '18 at 12:11

gaazkam

2,3481039

asked Nov 24 '18 at 12:11

gaazkam

2,3481039

asked Nov 24 '18 at 12:11

gaazkam

2,3481039

asked Nov 24 '18 at 12:11

gaazkam

2,3481039

asked Nov 24 '18 at 12:11

gaazkam

2,3481039

add a comment |

2 Answers
2

active

oldest

votes

Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?

No. String literals use the execution character set, which is defined by your compiler at compile time.

Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.

The following snippet seems to work for me. But doesn't it work only because of coincidence?

The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.

How to do this correctly?

Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.

For wide input and output, see e.g. this example in another answer here.

Do note that getwline() and getwdelim() are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc() is recommended instead. (One based on fgetws(), wcslen(), and/or wcscspn() will not be able to handle embedded nuls, L'', correctly.)

In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.

Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.

Your program can even use e.g.

    if (!setlocale(LC_ALL, ""))

        fprintf(stderr, "Warning: Your C library does not support your current locale.n");

    if (strcmp("UTF-8", nl_langinfo(CODESET)))

        fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");

to verify the current locale uses UTF-8.

I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.

answered Nov 24 '18 at 13:13

Nominal Animal

30.5k33463

If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

– Nominal Animal
Nov 24 '18 at 13:15

add a comment |

If you're willing to assume UTF-8,

strstr(s,"🎸")

Or:

strstr(s,u8"🎸")

The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:

strstr(s,"360237216270")

answered Nov 24 '18 at 14:12

R..

158k27263566

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53458016%2fwhat-encoding-are-character-string-literals-stored-in-or-how-to-find-a-liter%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?

No. String literals use the execution character set, which is defined by your compiler at compile time.

The following snippet seems to work for me. But doesn't it work only because of coincidence?

The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.

How to do this correctly?

Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.

For wide input and output, see e.g. this example in another answer here.

In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.

Your program can even use e.g.

    if (!setlocale(LC_ALL, ""))

        fprintf(stderr, "Warning: Your C library does not support your current locale.n");

    if (strcmp("UTF-8", nl_langinfo(CODESET)))

        fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");

to verify the current locale uses UTF-8.

I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.

answered Nov 24 '18 at 13:13

Nominal Animal

30.5k33463

If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

– Nominal Animal
Nov 24 '18 at 13:15

add a comment |

Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?

No. String literals use the execution character set, which is defined by your compiler at compile time.

The following snippet seems to work for me. But doesn't it work only because of coincidence?

The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.

How to do this correctly?

Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.

For wide input and output, see e.g. this example in another answer here.

In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.

Your program can even use e.g.

    if (!setlocale(LC_ALL, ""))

        fprintf(stderr, "Warning: Your C library does not support your current locale.n");

    if (strcmp("UTF-8", nl_langinfo(CODESET)))

        fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");

to verify the current locale uses UTF-8.

I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.

answered Nov 24 '18 at 13:13

Nominal Animal

30.5k33463

If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

– Nominal Animal
Nov 24 '18 at 13:15

add a comment |

Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?

No. String literals use the execution character set, which is defined by your compiler at compile time.

The following snippet seems to work for me. But doesn't it work only because of coincidence?

The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.

How to do this correctly?

Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.

For wide input and output, see e.g. this example in another answer here.

In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.

Your program can even use e.g.

    if (!setlocale(LC_ALL, ""))

        fprintf(stderr, "Warning: Your C library does not support your current locale.n");

    if (strcmp("UTF-8", nl_langinfo(CODESET)))

        fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");

to verify the current locale uses UTF-8.

I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.

answered Nov 24 '18 at 13:13

Nominal Animal

30.5k33463

Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?

No. String literals use the execution character set, which is defined by your compiler at compile time.

The following snippet seems to work for me. But doesn't it work only because of coincidence?

The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.

How to do this correctly?

Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.

For wide input and output, see e.g. this example in another answer here.

In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.

Your program can even use e.g.

    if (!setlocale(LC_ALL, ""))

        fprintf(stderr, "Warning: Your C library does not support your current locale.n");

    if (strcmp("UTF-8", nl_langinfo(CODESET)))

        fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");

to verify the current locale uses UTF-8.

I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.

answered Nov 24 '18 at 13:13

Nominal Animal

30.5k33463

answered Nov 24 '18 at 13:13

Nominal Animal

30.5k33463

answered Nov 24 '18 at 13:13

Nominal Animal

30.5k33463

answered Nov 24 '18 at 13:13

Nominal Animal

30.5k33463

If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

– Nominal Animal
Nov 24 '18 at 13:15

add a comment |

If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

– Nominal Animal
Nov 24 '18 at 13:15

If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

– Nominal Animal
Nov 24 '18 at 13:15

add a comment |

If you're willing to assume UTF-8,

strstr(s,"🎸")

Or:

strstr(s,u8"🎸")

The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:

strstr(s,"360237216270")

answered Nov 24 '18 at 14:12

R..

158k27263566

add a comment |

If you're willing to assume UTF-8,

strstr(s,"🎸")

Or:

strstr(s,u8"🎸")

The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:

strstr(s,"360237216270")

answered Nov 24 '18 at 14:12

R..

158k27263566

add a comment |

If you're willing to assume UTF-8,

strstr(s,"🎸")

Or:

strstr(s,u8"🎸")

The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:

strstr(s,"360237216270")

answered Nov 24 '18 at 14:12

R..

158k27263566

If you're willing to assume UTF-8,

strstr(s,"🎸")

Or:

strstr(s,u8"🎸")

The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:

strstr(s,"360237216270")

answered Nov 24 '18 at 14:12

R..

158k27263566

answered Nov 24 '18 at 14:12

R..

158k27263566

answered Nov 24 '18 at 14:12

R..

158k27263566

answered Nov 24 '18 at 14:12

R..

158k27263566

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

X2i7H8,eP

搜尋此網誌

Wsrtjtyk