D Unicode string literals: can't print specific Unicode character

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I'm just trying to pick up D having come from C++. I'm sure it's something very basic, but I can't find any documentation to help me. I'm trying to print the character à, which is U+00E0. I am trying to assign this character to a variable and then use write() to output it to the console.

I'm told by this website that U+00E0 is encoded as 0xC3 0xA0 in UTF-8, 0x00E0 in UTF-16 and 0x000000E0 in UTF-32.

Note that for everything I've tried, I've tried replacing string with char and wstring with wchar. I've also tried with and without the w or d suffixes after wide strings.

These methods return the compiler error, "Invalid trailing code unit":

string str = "à";

wstring str = "à"w;

dstring str = "à"d;

These methods print a totally different character (Ò U+00D2):

string str = "xE0";

string str = hexString!"E0";

And all these methods print what looks like ˧á (note á ≠ à!), which is UTF-16 0x2E7 0x00E1:

string str = "xC3xA0";

wstring str = "u00E0"w;

dstring str = "U000000E0"d;

Any ideas?

edited Nov 23 '18 at 20:53

0xdd

178114

asked Nov 23 '18 at 17:28

Joe C

143

What encoding are you saving the source file in and what encoding is your output terminal set to? And what operating system are you on? The language itself defines this stuff, but reading from source and writing to screen can introduce misunderstandings.

– Adam D. Ruppe
Nov 23 '18 at 17:31

1

The bottommost result looks like it thinks the encoding is IBM437.

– Mr Lister
Nov 23 '18 at 20:56

Thanks for responding! I'm on 64-bit Windows 10.0.17134. Trying to find or alter the source file encoding in Code::Blocks is a bit unclear. It seems to have previously been encoded in a WINDOWS encoding, but I've now switched it to UTF-32LE, recreated the project and issues continue. I find it quite likely that the issue is just in writing to the console, but this is essential to my needs. There seems to be a solution for C (docs.microsoft.com/en-us/windows/console/setconsoleoutputcp) - is there a D equivalent?

– Joe C
Nov 24 '18 at 14:32

You want the source encoded as UTF-8. The D compiler is a bit picky on that. Though if you can't do that, you can also stick to ASCII in the source and use uxxxx escapes to write the other characters. For the output, that same function is the answer: remember, D can call C functions the same as C. So yeah, SetConsoleOutputCP(65001) before doing output should work. You can import core.sys.windows.windows; to make that function defined.

– Adam D. Ruppe
Nov 24 '18 at 21:50

add a comment |

I'm told by this website that U+00E0 is encoded as 0xC3 0xA0 in UTF-8, 0x00E0 in UTF-16 and 0x000000E0 in UTF-32.

Note that for everything I've tried, I've tried replacing string with char and wstring with wchar. I've also tried with and without the w or d suffixes after wide strings.

These methods return the compiler error, "Invalid trailing code unit":

string str = "à";

wstring str = "à"w;

dstring str = "à"d;

These methods print a totally different character (Ò U+00D2):

string str = "xE0";

string str = hexString!"E0";

And all these methods print what looks like ˧á (note á ≠ à!), which is UTF-16 0x2E7 0x00E1:

string str = "xC3xA0";

wstring str = "u00E0"w;

dstring str = "U000000E0"d;

Any ideas?

edited Nov 23 '18 at 20:53

0xdd

178114

asked Nov 23 '18 at 17:28

Joe C

143

What encoding are you saving the source file in and what encoding is your output terminal set to? And what operating system are you on? The language itself defines this stuff, but reading from source and writing to screen can introduce misunderstandings.

– Adam D. Ruppe
Nov 23 '18 at 17:31

1

The bottommost result looks like it thinks the encoding is IBM437.

– Mr Lister
Nov 23 '18 at 20:56

Thanks for responding! I'm on 64-bit Windows 10.0.17134. Trying to find or alter the source file encoding in Code::Blocks is a bit unclear. It seems to have previously been encoded in a WINDOWS encoding, but I've now switched it to UTF-32LE, recreated the project and issues continue. I find it quite likely that the issue is just in writing to the console, but this is essential to my needs. There seems to be a solution for C (docs.microsoft.com/en-us/windows/console/setconsoleoutputcp) - is there a D equivalent?

– Joe C
Nov 24 '18 at 14:32

You want the source encoded as UTF-8. The D compiler is a bit picky on that. Though if you can't do that, you can also stick to ASCII in the source and use uxxxx escapes to write the other characters. For the output, that same function is the answer: remember, D can call C functions the same as C. So yeah, SetConsoleOutputCP(65001) before doing output should work. You can import core.sys.windows.windows; to make that function defined.

– Adam D. Ruppe
Nov 24 '18 at 21:50

add a comment |

I'm told by this website that U+00E0 is encoded as 0xC3 0xA0 in UTF-8, 0x00E0 in UTF-16 and 0x000000E0 in UTF-32.

Note that for everything I've tried, I've tried replacing string with char and wstring with wchar. I've also tried with and without the w or d suffixes after wide strings.

These methods return the compiler error, "Invalid trailing code unit":

string str = "à";

wstring str = "à"w;

dstring str = "à"d;

These methods print a totally different character (Ò U+00D2):

string str = "xE0";

string str = hexString!"E0";

And all these methods print what looks like ˧á (note á ≠ à!), which is UTF-16 0x2E7 0x00E1:

string str = "xC3xA0";

wstring str = "u00E0"w;

dstring str = "U000000E0"d;

Any ideas?

edited Nov 23 '18 at 20:53

0xdd

178114

asked Nov 23 '18 at 17:28

Joe C

143

I'm told by this website that U+00E0 is encoded as 0xC3 0xA0 in UTF-8, 0x00E0 in UTF-16 and 0x000000E0 in UTF-32.

Note that for everything I've tried, I've tried replacing string with char and wstring with wchar. I've also tried with and without the w or d suffixes after wide strings.

These methods return the compiler error, "Invalid trailing code unit":

string str = "à";

wstring str = "à"w;

dstring str = "à"d;

These methods print a totally different character (Ò U+00D2):

string str = "xE0";

string str = hexString!"E0";

And all these methods print what looks like ˧á (note á ≠ à!), which is UTF-16 0x2E7 0x00E1:

string str = "xC3xA0";

wstring str = "u00E0"w;

dstring str = "U000000E0"d;

Any ideas?

unicode d unicode-string unicode-escapes

edited Nov 23 '18 at 20:53

0xdd

178114

asked Nov 23 '18 at 17:28

Joe C

143

edited Nov 23 '18 at 20:53

0xdd

178114

asked Nov 23 '18 at 17:28

Joe C

143

edited Nov 23 '18 at 20:53

0xdd

178114

edited Nov 23 '18 at 20:53

0xdd

178114

edited Nov 23 '18 at 20:53

0xdd

178114

asked Nov 23 '18 at 17:28

Joe C

143

asked Nov 23 '18 at 17:28

Joe C

143

asked Nov 23 '18 at 17:28

Joe C

143

What encoding are you saving the source file in and what encoding is your output terminal set to? And what operating system are you on? The language itself defines this stuff, but reading from source and writing to screen can introduce misunderstandings.

– Adam D. Ruppe
Nov 23 '18 at 17:31

1

The bottommost result looks like it thinks the encoding is IBM437.

– Mr Lister
Nov 23 '18 at 20:56

Thanks for responding! I'm on 64-bit Windows 10.0.17134. Trying to find or alter the source file encoding in Code::Blocks is a bit unclear. It seems to have previously been encoded in a WINDOWS encoding, but I've now switched it to UTF-32LE, recreated the project and issues continue. I find it quite likely that the issue is just in writing to the console, but this is essential to my needs. There seems to be a solution for C (docs.microsoft.com/en-us/windows/console/setconsoleoutputcp) - is there a D equivalent?

– Joe C
Nov 24 '18 at 14:32

You want the source encoded as UTF-8. The D compiler is a bit picky on that. Though if you can't do that, you can also stick to ASCII in the source and use uxxxx escapes to write the other characters. For the output, that same function is the answer: remember, D can call C functions the same as C. So yeah, SetConsoleOutputCP(65001) before doing output should work. You can import core.sys.windows.windows; to make that function defined.

– Adam D. Ruppe
Nov 24 '18 at 21:50

add a comment |

What encoding are you saving the source file in and what encoding is your output terminal set to? And what operating system are you on? The language itself defines this stuff, but reading from source and writing to screen can introduce misunderstandings.

– Adam D. Ruppe
Nov 23 '18 at 17:31

1

The bottommost result looks like it thinks the encoding is IBM437.

– Mr Lister
Nov 23 '18 at 20:56

Thanks for responding! I'm on 64-bit Windows 10.0.17134. Trying to find or alter the source file encoding in Code::Blocks is a bit unclear. It seems to have previously been encoded in a WINDOWS encoding, but I've now switched it to UTF-32LE, recreated the project and issues continue. I find it quite likely that the issue is just in writing to the console, but this is essential to my needs. There seems to be a solution for C (docs.microsoft.com/en-us/windows/console/setconsoleoutputcp) - is there a D equivalent?

– Joe C
Nov 24 '18 at 14:32

You want the source encoded as UTF-8. The D compiler is a bit picky on that. Though if you can't do that, you can also stick to ASCII in the source and use uxxxx escapes to write the other characters. For the output, that same function is the answer: remember, D can call C functions the same as C. So yeah, SetConsoleOutputCP(65001) before doing output should work. You can import core.sys.windows.windows; to make that function defined.

– Adam D. Ruppe
Nov 24 '18 at 21:50

What encoding are you saving the source file in and what encoding is your output terminal set to? And what operating system are you on? The language itself defines this stuff, but reading from source and writing to screen can introduce misunderstandings.

– Adam D. Ruppe
Nov 23 '18 at 17:31

The bottommost result looks like it thinks the encoding is IBM437.

– Mr Lister
Nov 23 '18 at 20:56

Thanks for responding! I'm on 64-bit Windows 10.0.17134. Trying to find or alter the source file encoding in Code::Blocks is a bit unclear. It seems to have previously been encoded in a WINDOWS encoding, but I've now switched it to UTF-32LE, recreated the project and issues continue. I find it quite likely that the issue is just in writing to the console, but this is essential to my needs. There seems to be a solution for C (docs.microsoft.com/en-us/windows/console/setconsoleoutputcp) - is there a D equivalent?

– Joe C
Nov 24 '18 at 14:32

You want the source encoded as UTF-8. The D compiler is a bit picky on that. Though if you can't do that, you can also stick to ASCII in the source and use uxxxx escapes to write the other characters. For the output, that same function is the answer: remember, D can call C functions the same as C. So yeah, SetConsoleOutputCP(65001) before doing output should work. You can import core.sys.windows.windows; to make that function defined.

– Adam D. Ruppe
Nov 24 '18 at 21:50

add a comment |

2 Answers
2

active

oldest

votes

I confirmed it works on my Windows box, so gonna type this up as an answer now.

In the source code, if you copy/paste the characters directly, make sure your editor is saving it in utf8 encoding. The D compiler insists on it, so if it gives a compile error about a utf thing, that's probably why. I have never used c:b but an old answer on the web said edit->encodings... it is a setting somewhere in the editor regardless.

Or, you can replace the characters in your source code with uxxxx in the strings. Do NOT use the hexstring thing, that is for binary bytes, but your example of "u00E0" is good, and will work for any type of string (not just wstring like in your example).

Then, on the output side, it depends on your target because the program just outputs bytes, and it is up to the recipient program to interpret it correctly. Since you said you are on Windows, the key is to set the console code page to utf-8 so it knows what you are trying to do. Indeed, the same C function can be called from D too. Leading to this program:

import core.sys.windows.windows;

import std.stdio;



void main() {

    SetConsoleOutputCP(65001);

    writeln("Hi u00E0");

}

printing it successfully. On older Windows versions, you might need to change your font to see the character too (as opposed to the generic box it shows because some fonts don't have all the characters), but on my Windows 10 box, it just worked with the default font.

BTW, technically the console code page a shared setting (after running the program and it exits, you can still hit properties on your console window and see the change reflected there) and you should perhaps set it back when your program exits. You could get that at startup with the get function ( https://docs.microsoft.com/en-us/windows/console/getconsoleoutputcp ), store it in a local var, and set it back on exit. You could auto ccp = GetConsoleOutputCP(); SetConsoleOutputCP(65005;) scope(exit) SetConsoleOutputCP(ccp); right at startup - the scope exit will run when the function exits, so doing it in main would be kinda convenient. Just add some error checking if you want.

The Microsoft docs don't say anything about setting it back, so it probably doesn't actually matter, but still I wanna mention it just in case. But also the knowledge that it is shared and persists can help in debugging - if it works after you comment it, it isn't because the code isn't necessary, it is just because it was set previously and not unset yet!

Note that running it from an IDE might not be exactly the same, because IDEs often pipe the output instead of running it right out to the Windows console. If that happens, lemme know and we can type up some stuff about that for future readers too. But you can also open your own copy of the console (run the program outside the IDE) and it should show correctly for you.

answered Nov 25 '18 at 0:51

Adam D. Ruppe

22.9k43353

Brilliant, works a charm! Just to note that the UFT-8 encoding "xC3xA0" works just as well as "u00E0", which is the same character in UTF-16.

– Joe C
Nov 26 '18 at 14:24

Right, you can do it byte by byte, but the compiler will translate the various codepoints (strictly speaking, the uxxxx is not utf-16, it is the unicode code point number) into the correct encoding for the given string. So using the u stuff will make the right utf-8 bytes in that context, or utf-16 bytes in that context, etc.

– Adam D. Ruppe
Nov 26 '18 at 16:42

add a comment |

D source code needs to be encoded as UTF-8.
My guess is that you're putting a UTF-16 character into the UTF-8 source file.

E.g.

import std.stdio;

void main() {

    writeln(cast(char)0xC3, cast(char)0xA0);

}

Will output as UTF-8 the character you seek.

Which you can then hard code like so:

import std.stdio;

void main() {

    string str = "à";

    writeln(str);

}

answered Nov 23 '18 at 17:36

Richard Andrew Cattermole

46328

Thanks for having a go, but sadly these have the same problems as the methods I already tried...

– Joe C
Nov 24 '18 at 14:33

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450841%2fd-unicode-string-literals-cant-print-specific-unicode-character%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I confirmed it works on my Windows box, so gonna type this up as an answer now.

import core.sys.windows.windows;

import std.stdio;



void main() {

    SetConsoleOutputCP(65001);

    writeln("Hi u00E0");

}

answered Nov 25 '18 at 0:51

Adam D. Ruppe

22.9k43353

Brilliant, works a charm! Just to note that the UFT-8 encoding "xC3xA0" works just as well as "u00E0", which is the same character in UTF-16.

– Joe C
Nov 26 '18 at 14:24

Right, you can do it byte by byte, but the compiler will translate the various codepoints (strictly speaking, the uxxxx is not utf-16, it is the unicode code point number) into the correct encoding for the given string. So using the u stuff will make the right utf-8 bytes in that context, or utf-16 bytes in that context, etc.

– Adam D. Ruppe
Nov 26 '18 at 16:42

add a comment |

I confirmed it works on my Windows box, so gonna type this up as an answer now.

import core.sys.windows.windows;

import std.stdio;



void main() {

    SetConsoleOutputCP(65001);

    writeln("Hi u00E0");

}

answered Nov 25 '18 at 0:51

Adam D. Ruppe

22.9k43353

Brilliant, works a charm! Just to note that the UFT-8 encoding "xC3xA0" works just as well as "u00E0", which is the same character in UTF-16.

– Joe C
Nov 26 '18 at 14:24

Right, you can do it byte by byte, but the compiler will translate the various codepoints (strictly speaking, the uxxxx is not utf-16, it is the unicode code point number) into the correct encoding for the given string. So using the u stuff will make the right utf-8 bytes in that context, or utf-16 bytes in that context, etc.

– Adam D. Ruppe
Nov 26 '18 at 16:42

add a comment |

I confirmed it works on my Windows box, so gonna type this up as an answer now.

import core.sys.windows.windows;

import std.stdio;



void main() {

    SetConsoleOutputCP(65001);

    writeln("Hi u00E0");

}

answered Nov 25 '18 at 0:51

Adam D. Ruppe

22.9k43353

I confirmed it works on my Windows box, so gonna type this up as an answer now.

import core.sys.windows.windows;

import std.stdio;



void main() {

    SetConsoleOutputCP(65001);

    writeln("Hi u00E0");

}

answered Nov 25 '18 at 0:51

Adam D. Ruppe

22.9k43353

answered Nov 25 '18 at 0:51

Adam D. Ruppe

22.9k43353

answered Nov 25 '18 at 0:51

Adam D. Ruppe

22.9k43353

answered Nov 25 '18 at 0:51

Adam D. Ruppe

22.9k43353

Brilliant, works a charm! Just to note that the UFT-8 encoding "xC3xA0" works just as well as "u00E0", which is the same character in UTF-16.

– Joe C
Nov 26 '18 at 14:24

Right, you can do it byte by byte, but the compiler will translate the various codepoints (strictly speaking, the uxxxx is not utf-16, it is the unicode code point number) into the correct encoding for the given string. So using the u stuff will make the right utf-8 bytes in that context, or utf-16 bytes in that context, etc.

– Adam D. Ruppe
Nov 26 '18 at 16:42

add a comment |

Brilliant, works a charm! Just to note that the UFT-8 encoding "xC3xA0" works just as well as "u00E0", which is the same character in UTF-16.

– Joe C
Nov 26 '18 at 14:24

Right, you can do it byte by byte, but the compiler will translate the various codepoints (strictly speaking, the uxxxx is not utf-16, it is the unicode code point number) into the correct encoding for the given string. So using the u stuff will make the right utf-8 bytes in that context, or utf-16 bytes in that context, etc.

– Adam D. Ruppe
Nov 26 '18 at 16:42

Brilliant, works a charm! Just to note that the UFT-8 encoding "xC3xA0" works just as well as "u00E0", which is the same character in UTF-16.

– Joe C
Nov 26 '18 at 14:24

Right, you can do it byte by byte, but the compiler will translate the various codepoints (strictly speaking, the uxxxx is not utf-16, it is the unicode code point number) into the correct encoding for the given string. So using the u stuff will make the right utf-8 bytes in that context, or utf-16 bytes in that context, etc.

– Adam D. Ruppe
Nov 26 '18 at 16:42

add a comment |

D source code needs to be encoded as UTF-8.
My guess is that you're putting a UTF-16 character into the UTF-8 source file.

E.g.

import std.stdio;

void main() {

    writeln(cast(char)0xC3, cast(char)0xA0);

}

Will output as UTF-8 the character you seek.

Which you can then hard code like so:

import std.stdio;

void main() {

    string str = "à";

    writeln(str);

}

answered Nov 23 '18 at 17:36

Richard Andrew Cattermole

46328

Thanks for having a go, but sadly these have the same problems as the methods I already tried...

– Joe C
Nov 24 '18 at 14:33

add a comment |

D source code needs to be encoded as UTF-8.
My guess is that you're putting a UTF-16 character into the UTF-8 source file.

E.g.

import std.stdio;

void main() {

    writeln(cast(char)0xC3, cast(char)0xA0);

}

Will output as UTF-8 the character you seek.

Which you can then hard code like so:

import std.stdio;

void main() {

    string str = "à";

    writeln(str);

}

answered Nov 23 '18 at 17:36

Richard Andrew Cattermole

46328

Thanks for having a go, but sadly these have the same problems as the methods I already tried...

– Joe C
Nov 24 '18 at 14:33

add a comment |

D source code needs to be encoded as UTF-8.
My guess is that you're putting a UTF-16 character into the UTF-8 source file.

E.g.

import std.stdio;

void main() {

    writeln(cast(char)0xC3, cast(char)0xA0);

}

Will output as UTF-8 the character you seek.

Which you can then hard code like so:

import std.stdio;

void main() {

    string str = "à";

    writeln(str);

}

answered Nov 23 '18 at 17:36

Richard Andrew Cattermole

46328

D source code needs to be encoded as UTF-8.
My guess is that you're putting a UTF-16 character into the UTF-8 source file.

E.g.

import std.stdio;

void main() {

    writeln(cast(char)0xC3, cast(char)0xA0);

}

Will output as UTF-8 the character you seek.

Which you can then hard code like so:

import std.stdio;

void main() {

    string str = "à";

    writeln(str);

}

answered Nov 23 '18 at 17:36

Richard Andrew Cattermole

46328

answered Nov 23 '18 at 17:36

Richard Andrew Cattermole

46328

answered Nov 23 '18 at 17:36

Richard Andrew Cattermole

46328

answered Nov 23 '18 at 17:36

Richard Andrew Cattermole

46328

Thanks for having a go, but sadly these have the same problems as the methods I already tried...

– Joe C
Nov 24 '18 at 14:33

add a comment |

Thanks for having a go, but sadly these have the same problems as the methods I already tried...

– Joe C
Nov 24 '18 at 14:33

Thanks for having a go, but sadly these have the same problems as the methods I already tried...

– Joe C
Nov 24 '18 at 14:33

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk