Convert array of eight bytes to eight integers





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







1















I am working with the Xeon Phi Knights Landing. I need to do a gather operation from an array of doubles. The list of indices comes from an array of chars. The gather operations are either _mm512_i32gather_pd or _mm512_i64gather_pd. As I understand it, I either need to convert eight chars to to eight 32-bit integers or eight chars to 64-bit integers. I have gone with the first choice for _mm512_i32gather_pd.



I have created two functions get_index and get_index2 to convert eight chars to a __m256i. The assembly for get_index is simpler than for get_index2 see https://godbolt.org/z/lhg9fX. However, in my code get_index2 is significantly faster. Why is this? I am using ICC 18. Maybe there is a better solution than either of these two functions?



#include <x86intrin.h>
#include <inttypes.h>

__m256i get_index(char *index) {
int64_t x = *(int64_t *)&index[0];
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);

__m256i t2 = _mm256_set1_epi64x(x);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}

__m256i get_index2(char *index) {
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);
__m128i t1 = _mm_loadl_epi64((__m128i*)index);
__m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}









share|improve this question


















  • 2





    KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

    – Peter Cordes
    Nov 24 '18 at 18:28













  • That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

    – Peter Cordes
    Nov 24 '18 at 18:34






  • 1





    @PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

    – Z boson
    Nov 26 '18 at 12:10




















1















I am working with the Xeon Phi Knights Landing. I need to do a gather operation from an array of doubles. The list of indices comes from an array of chars. The gather operations are either _mm512_i32gather_pd or _mm512_i64gather_pd. As I understand it, I either need to convert eight chars to to eight 32-bit integers or eight chars to 64-bit integers. I have gone with the first choice for _mm512_i32gather_pd.



I have created two functions get_index and get_index2 to convert eight chars to a __m256i. The assembly for get_index is simpler than for get_index2 see https://godbolt.org/z/lhg9fX. However, in my code get_index2 is significantly faster. Why is this? I am using ICC 18. Maybe there is a better solution than either of these two functions?



#include <x86intrin.h>
#include <inttypes.h>

__m256i get_index(char *index) {
int64_t x = *(int64_t *)&index[0];
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);

__m256i t2 = _mm256_set1_epi64x(x);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}

__m256i get_index2(char *index) {
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);
__m128i t1 = _mm_loadl_epi64((__m128i*)index);
__m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}









share|improve this question


















  • 2





    KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

    – Peter Cordes
    Nov 24 '18 at 18:28













  • That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

    – Peter Cordes
    Nov 24 '18 at 18:34






  • 1





    @PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

    – Z boson
    Nov 26 '18 at 12:10
















1












1








1








I am working with the Xeon Phi Knights Landing. I need to do a gather operation from an array of doubles. The list of indices comes from an array of chars. The gather operations are either _mm512_i32gather_pd or _mm512_i64gather_pd. As I understand it, I either need to convert eight chars to to eight 32-bit integers or eight chars to 64-bit integers. I have gone with the first choice for _mm512_i32gather_pd.



I have created two functions get_index and get_index2 to convert eight chars to a __m256i. The assembly for get_index is simpler than for get_index2 see https://godbolt.org/z/lhg9fX. However, in my code get_index2 is significantly faster. Why is this? I am using ICC 18. Maybe there is a better solution than either of these two functions?



#include <x86intrin.h>
#include <inttypes.h>

__m256i get_index(char *index) {
int64_t x = *(int64_t *)&index[0];
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);

__m256i t2 = _mm256_set1_epi64x(x);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}

__m256i get_index2(char *index) {
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);
__m128i t1 = _mm_loadl_epi64((__m128i*)index);
__m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}









share|improve this question














I am working with the Xeon Phi Knights Landing. I need to do a gather operation from an array of doubles. The list of indices comes from an array of chars. The gather operations are either _mm512_i32gather_pd or _mm512_i64gather_pd. As I understand it, I either need to convert eight chars to to eight 32-bit integers or eight chars to 64-bit integers. I have gone with the first choice for _mm512_i32gather_pd.



I have created two functions get_index and get_index2 to convert eight chars to a __m256i. The assembly for get_index is simpler than for get_index2 see https://godbolt.org/z/lhg9fX. However, in my code get_index2 is significantly faster. Why is this? I am using ICC 18. Maybe there is a better solution than either of these two functions?



#include <x86intrin.h>
#include <inttypes.h>

__m256i get_index(char *index) {
int64_t x = *(int64_t *)&index[0];
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);

__m256i t2 = _mm256_set1_epi64x(x);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}

__m256i get_index2(char *index) {
const __m256i t3 = _mm256_setr_epi8(
0,0x80,0x80,0x80,
1,0x80,0x80,0x80,
2,0x80,0x80,0x80,
3,0x80,0x80,0x80,
4,0x80,0x80,0x80,
5,0x80,0x80,0x80,
6,0x80,0x80,0x80,
7,0x80,0x80,0x80);
__m128i t1 = _mm_loadl_epi64((__m128i*)index);
__m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);
__m256i t4 = _mm256_shuffle_epi8(t2, t3);
return t4;
}






x86 avx2 xeon-phi avx512 knights-landing






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 24 '18 at 14:25









Z bosonZ boson

21.1k782154




21.1k782154








  • 2





    KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

    – Peter Cordes
    Nov 24 '18 at 18:28













  • That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

    – Peter Cordes
    Nov 24 '18 at 18:34






  • 1





    @PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

    – Z boson
    Nov 26 '18 at 12:10
















  • 2





    KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

    – Peter Cordes
    Nov 24 '18 at 18:28













  • That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

    – Peter Cordes
    Nov 24 '18 at 18:34






  • 1





    @PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

    – Z boson
    Nov 26 '18 at 12:10










2




2





KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

– Peter Cordes
Nov 24 '18 at 18:28







KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

– Peter Cordes
Nov 24 '18 at 18:28















That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

– Peter Cordes
Nov 24 '18 at 18:34





That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

– Peter Cordes
Nov 24 '18 at 18:34




1




1





@PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

– Z boson
Nov 26 '18 at 12:10







@PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

– Z boson
Nov 26 '18 at 12:10














0






active

oldest

votes












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53459134%2fconvert-array-of-eight-bytes-to-eight-integers%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53459134%2fconvert-array-of-eight-bytes-to-eight-integers%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud

Zucchini