What encoding are character / string literals stored in? (Or how to find a literal character in a string from...





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







2















As we know, different encodings map different representations to same characters. Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!



This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?



In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?



#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>

int main()
{
setlocale(LC_ALL, "");

// Read line and convert it to wide string so that wcschr can be used
// So many lines! And that's even though I'm omitting the necessary
// error checking for brevity. Ah I'm also omitting free's
char *s = NULL; size_t n = 0;
getline(&s, &n, stdin);
mbstate_t st = {0}; const char* cs = s;
size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));
st = (mbstate_t){0};
mbsrtowcs(ws, &cs, (wn+1), &st);

int contains_guitar = (wcschr(ws, L'🎸') != NULL);
if(contains_guitar)
printf("Let's rock!n");
else
printf("Let's not.n");
return 0;
}


How to do this correctly?










share|improve this question





























    2















    As we know, different encodings map different representations to same characters. Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!



    This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?



    In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?



    #include <stdio.h>
    #include <wchar.h>
    #include <stdlib.h>
    #include <locale.h>

    int main()
    {
    setlocale(LC_ALL, "");

    // Read line and convert it to wide string so that wcschr can be used
    // So many lines! And that's even though I'm omitting the necessary
    // error checking for brevity. Ah I'm also omitting free's
    char *s = NULL; size_t n = 0;
    getline(&s, &n, stdin);
    mbstate_t st = {0}; const char* cs = s;
    size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
    wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));
    st = (mbstate_t){0};
    mbsrtowcs(ws, &cs, (wn+1), &st);

    int contains_guitar = (wcschr(ws, L'🎸') != NULL);
    if(contains_guitar)
    printf("Let's rock!n");
    else
    printf("Let's not.n");
    return 0;
    }


    How to do this correctly?










    share|improve this question

























      2












      2








      2








      As we know, different encodings map different representations to same characters. Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!



      This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?



      In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?



      #include <stdio.h>
      #include <wchar.h>
      #include <stdlib.h>
      #include <locale.h>

      int main()
      {
      setlocale(LC_ALL, "");

      // Read line and convert it to wide string so that wcschr can be used
      // So many lines! And that's even though I'm omitting the necessary
      // error checking for brevity. Ah I'm also omitting free's
      char *s = NULL; size_t n = 0;
      getline(&s, &n, stdin);
      mbstate_t st = {0}; const char* cs = s;
      size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
      wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));
      st = (mbstate_t){0};
      mbsrtowcs(ws, &cs, (wn+1), &st);

      int contains_guitar = (wcschr(ws, L'🎸') != NULL);
      if(contains_guitar)
      printf("Let's rock!n");
      else
      printf("Let's not.n");
      return 0;
      }


      How to do this correctly?










      share|improve this question














      As we know, different encodings map different representations to same characters. Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!



      This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?



      In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?



      #include <stdio.h>
      #include <wchar.h>
      #include <stdlib.h>
      #include <locale.h>

      int main()
      {
      setlocale(LC_ALL, "");

      // Read line and convert it to wide string so that wcschr can be used
      // So many lines! And that's even though I'm omitting the necessary
      // error checking for brevity. Ah I'm also omitting free's
      char *s = NULL; size_t n = 0;
      getline(&s, &n, stdin);
      mbstate_t st = {0}; const char* cs = s;
      size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
      wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));
      st = (mbstate_t){0};
      mbsrtowcs(ws, &cs, (wn+1), &st);

      int contains_guitar = (wcschr(ws, L'🎸') != NULL);
      if(contains_guitar)
      printf("Let's rock!n");
      else
      printf("Let's not.n");
      return 0;
      }


      How to do this correctly?







      c string encoding locale string-literals






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 24 '18 at 12:11









      gaazkamgaazkam

      2,3481039




      2,3481039
























          2 Answers
          2






          active

          oldest

          votes


















          0















          Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?




          No. String literals use the execution character set, which is defined by your compiler at compile time.



          Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.




          The following snippet seems to work for me. But doesn't it work only because of coincidence?




          The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.




          How to do this correctly?




          Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.



          For wide input and output, see e.g. this example in another answer here.



          Do note that getwline() and getwdelim() are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc() is recommended instead. (One based on fgetws(), wcslen(), and/or wcscspn() will not be able to handle embedded nuls, L'', correctly.)



          In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.



          Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.



          Your program can even use e.g.



              if (!setlocale(LC_ALL, ""))
          fprintf(stderr, "Warning: Your C library does not support your current locale.n");
          if (strcmp("UTF-8", nl_langinfo(CODESET)))
          fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");


          to verify the current locale uses UTF-8.



          I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.






          share|improve this answer
























          • If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

            – Nominal Animal
            Nov 24 '18 at 13:15



















          0














          If you're willing to assume UTF-8,



          strstr(s,"🎸")


          Or:



          strstr(s,u8"🎸")


          The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:



          strstr(s,"360237216270")





          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53458016%2fwhat-encoding-are-character-string-literals-stored-in-or-how-to-find-a-liter%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0















            Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?




            No. String literals use the execution character set, which is defined by your compiler at compile time.



            Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.




            The following snippet seems to work for me. But doesn't it work only because of coincidence?




            The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.




            How to do this correctly?




            Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.



            For wide input and output, see e.g. this example in another answer here.



            Do note that getwline() and getwdelim() are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc() is recommended instead. (One based on fgetws(), wcslen(), and/or wcscspn() will not be able to handle embedded nuls, L'', correctly.)



            In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.



            Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.



            Your program can even use e.g.



                if (!setlocale(LC_ALL, ""))
            fprintf(stderr, "Warning: Your C library does not support your current locale.n");
            if (strcmp("UTF-8", nl_langinfo(CODESET)))
            fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");


            to verify the current locale uses UTF-8.



            I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.






            share|improve this answer
























            • If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

              – Nominal Animal
              Nov 24 '18 at 13:15
















            0















            Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?




            No. String literals use the execution character set, which is defined by your compiler at compile time.



            Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.




            The following snippet seems to work for me. But doesn't it work only because of coincidence?




            The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.




            How to do this correctly?




            Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.



            For wide input and output, see e.g. this example in another answer here.



            Do note that getwline() and getwdelim() are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc() is recommended instead. (One based on fgetws(), wcslen(), and/or wcscspn() will not be able to handle embedded nuls, L'', correctly.)



            In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.



            Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.



            Your program can even use e.g.



                if (!setlocale(LC_ALL, ""))
            fprintf(stderr, "Warning: Your C library does not support your current locale.n");
            if (strcmp("UTF-8", nl_langinfo(CODESET)))
            fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");


            to verify the current locale uses UTF-8.



            I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.






            share|improve this answer
























            • If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

              – Nominal Animal
              Nov 24 '18 at 13:15














            0












            0








            0








            Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?




            No. String literals use the execution character set, which is defined by your compiler at compile time.



            Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.




            The following snippet seems to work for me. But doesn't it work only because of coincidence?




            The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.




            How to do this correctly?




            Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.



            For wide input and output, see e.g. this example in another answer here.



            Do note that getwline() and getwdelim() are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc() is recommended instead. (One based on fgetws(), wcslen(), and/or wcscspn() will not be able to handle embedded nuls, L'', correctly.)



            In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.



            Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.



            Your program can even use e.g.



                if (!setlocale(LC_ALL, ""))
            fprintf(stderr, "Warning: Your C library does not support your current locale.n");
            if (strcmp("UTF-8", nl_langinfo(CODESET)))
            fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");


            to verify the current locale uses UTF-8.



            I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.






            share|improve this answer














            Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?




            No. String literals use the execution character set, which is defined by your compiler at compile time.



            Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.




            The following snippet seems to work for me. But doesn't it work only because of coincidence?




            The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.




            How to do this correctly?




            Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.



            For wide input and output, see e.g. this example in another answer here.



            Do note that getwline() and getwdelim() are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc() is recommended instead. (One based on fgetws(), wcslen(), and/or wcscspn() will not be able to handle embedded nuls, L'', correctly.)



            In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.



            Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.



            Your program can even use e.g.



                if (!setlocale(LC_ALL, ""))
            fprintf(stderr, "Warning: Your C library does not support your current locale.n");
            if (strcmp("UTF-8", nl_langinfo(CODESET)))
            fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.n");


            to verify the current locale uses UTF-8.



            I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 24 '18 at 13:13









            Nominal AnimalNominal Animal

            30.5k33463




            30.5k33463













            • If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

              – Nominal Animal
              Nov 24 '18 at 13:15



















            • If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

              – Nominal Animal
              Nov 24 '18 at 13:15

















            If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

            – Nominal Animal
            Nov 24 '18 at 13:15





            If a "proper" example of getwline() and counting the number of Unicode U+1F3B8 🎸 glyphs would help, let me know in a comment.

            – Nominal Animal
            Nov 24 '18 at 13:15













            0














            If you're willing to assume UTF-8,



            strstr(s,"🎸")


            Or:



            strstr(s,u8"🎸")


            The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:



            strstr(s,"360237216270")





            share|improve this answer




























              0














              If you're willing to assume UTF-8,



              strstr(s,"🎸")


              Or:



              strstr(s,u8"🎸")


              The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:



              strstr(s,"360237216270")





              share|improve this answer


























                0












                0








                0







                If you're willing to assume UTF-8,



                strstr(s,"🎸")


                Or:



                strstr(s,u8"🎸")


                The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:



                strstr(s,"360237216270")





                share|improve this answer













                If you're willing to assume UTF-8,



                strstr(s,"🎸")


                Or:



                strstr(s,u8"🎸")


                The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:



                strstr(s,"360237216270")






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 24 '18 at 14:12









                R..R..

                158k27263566




                158k27263566






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53458016%2fwhat-encoding-are-character-string-literals-stored-in-or-how-to-find-a-liter%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    Tangent Lines Diagram Along Smooth Curve

                    Yusuf al-Mu'taman ibn Hud

                    Zucchini