Convert array of eight bytes to eight integers

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I am working with the Xeon Phi Knights Landing. I need to do a gather operation from an array of doubles. The list of indices comes from an array of chars. The gather operations are either _mm512_i32gather_pd or _mm512_i64gather_pd. As I understand it, I either need to convert eight chars to to eight 32-bit integers or eight chars to 64-bit integers. I have gone with the first choice for _mm512_i32gather_pd.

I have created two functions get_index and get_index2 to convert eight chars to a __m256i. The assembly for get_index is simpler than for get_index2 see https://godbolt.org/z/lhg9fX. However, in my code get_index2 is significantly faster. Why is this? I am using ICC 18. Maybe there is a better solution than either of these two functions?

#include <x86intrin.h>

#include <inttypes.h>



__m256i get_index(char *index) {                                                                                                                                      

  int64_t x = *(int64_t *)&index[0];                                                                                                                                             

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);                                                                                                                                                     



  __m256i t2 = _mm256_set1_epi64x(x);                                                                                                                                            

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);                                                                                                                                      

  return t4;                                                                                                                                                                     

}                



__m256i get_index2(char *index) {

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);

  __m128i t1  = _mm_loadl_epi64((__m128i*)index);

  __m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);

  return t4;

}

asked Nov 24 '18 at 14:25

Z boson

21.1k782154

2

KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

– Peter Cordes
Nov 24 '18 at 18:28

That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

– Peter Cordes
Nov 24 '18 at 18:34

1

@PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

– Z boson
Nov 26 '18 at 12:10

add a comment |

#include <x86intrin.h>

#include <inttypes.h>



__m256i get_index(char *index) {                                                                                                                                      

  int64_t x = *(int64_t *)&index[0];                                                                                                                                             

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);                                                                                                                                                     



  __m256i t2 = _mm256_set1_epi64x(x);                                                                                                                                            

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);                                                                                                                                      

  return t4;                                                                                                                                                                     

}                



__m256i get_index2(char *index) {

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);

  __m128i t1  = _mm_loadl_epi64((__m128i*)index);

  __m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);

  return t4;

}

asked Nov 24 '18 at 14:25

Z boson

21.1k782154

2

KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

– Peter Cordes
Nov 24 '18 at 18:28

That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

– Peter Cordes
Nov 24 '18 at 18:34

1

@PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

– Z boson
Nov 26 '18 at 12:10

add a comment |

#include <x86intrin.h>

#include <inttypes.h>



__m256i get_index(char *index) {                                                                                                                                      

  int64_t x = *(int64_t *)&index[0];                                                                                                                                             

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);                                                                                                                                                     



  __m256i t2 = _mm256_set1_epi64x(x);                                                                                                                                            

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);                                                                                                                                      

  return t4;                                                                                                                                                                     

}                



__m256i get_index2(char *index) {

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);

  __m128i t1  = _mm_loadl_epi64((__m128i*)index);

  __m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);

  return t4;

}

asked Nov 24 '18 at 14:25

Z boson

21.1k782154

#include <x86intrin.h>

#include <inttypes.h>



__m256i get_index(char *index) {                                                                                                                                      

  int64_t x = *(int64_t *)&index[0];                                                                                                                                             

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);                                                                                                                                                     



  __m256i t2 = _mm256_set1_epi64x(x);                                                                                                                                            

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);                                                                                                                                      

  return t4;                                                                                                                                                                     

}                



__m256i get_index2(char *index) {

  const __m256i t3 = _mm256_setr_epi8(

    0,0x80,0x80,0x80,

    1,0x80,0x80,0x80,

    2,0x80,0x80,0x80,

    3,0x80,0x80,0x80,

    4,0x80,0x80,0x80,

    5,0x80,0x80,0x80,

    6,0x80,0x80,0x80,

    7,0x80,0x80,0x80);

  __m128i t1  = _mm_loadl_epi64((__m128i*)index);

  __m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);

  __m256i t4 = _mm256_shuffle_epi8(t2, t3);

  return t4;

}

x86 avx2 xeon-phi avx512 knights-landing

asked Nov 24 '18 at 14:25

Z boson

21.1k782154

asked Nov 24 '18 at 14:25

Z boson

21.1k782154

asked Nov 24 '18 at 14:25

Z boson

21.1k782154

asked Nov 24 '18 at 14:25

Z boson

21.1k782154

asked Nov 24 '18 at 14:25

Z boson

21.1k782154

2

KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

– Peter Cordes
Nov 24 '18 at 18:28

That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

– Peter Cordes
Nov 24 '18 at 18:34

1

@PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

– Z boson
Nov 26 '18 at 12:10

add a comment |

2

KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

– Peter Cordes
Nov 24 '18 at 18:28

That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

– Peter Cordes
Nov 24 '18 at 18:34

1

@PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

– Z boson
Nov 26 '18 at 12:10

KNL has very slow 256-bit vpshufb ymm (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use vpmovzxbd or bq like a normal person? __m512i _mm512_cvtepu8_epi32(__m128i a) or _mm256_cvtepu8_epi32. Those are all single-uop with 2c throughput.

– Peter Cordes
Nov 24 '18 at 18:28

That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end.

– Peter Cordes
Nov 24 '18 at 18:34

@PeterCordes, thank you for pointing out _mm256_cvtepu8_epi32, that's exactly what I want, the result is no faster than get_index2 though in my code. Maybe ICC converts get_index2 to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with #pragma ivdep). I'm vectorizing stencil code.

– Z boson
Nov 26 '18 at 12:10

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53459134%2fconvert-array-of-eight-bytes-to-eight-integers%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

l GOB,4 LaHRsAdPE3rm

搜尋此網誌

Wsrtjtyk