c++ openmp false-sharing on aligned array example

I would like to see the effect of false sharing. To do so, I tried to design a small experiment but I got unexpected results.

I have an array containing 100 m integers. Consider it as m x n matrix. One thread changes odd indexed rows and other thread changes even indexed rows.

Experiment A: The number of columns is 16. So each row is 64 bytes, it is exactly my cacheline size. Since each thread processes exactly 1 cacheline at a time, there should be no false-sharing. Therefore, I expect around 100% speedup.

Experiment B: The number of columns is 8. Each thread changes 32 bytes at a time, which is half of cacheline. For example, if thread 1 processes row 33, data should be transferred from thread 0 because thread 1 has already processed row 32 which is in the same cacheline. (or vice versa, the order does not matter). Because of this communitcation, speedup should be low.

#include <iostream>

#include <omp.h>



using namespace std;



int main(int argc, char** argv) {



    if(argc != 3) {

        cout << "Usage: " << argv[0] << " <iteration> <col_count>" << endl;

        return 1;

    }



    int thread_count = omp_get_max_threads();

    int iteration = atoi(argv[1]);

    int col_count = atoi(argv[2]);

    int arr_size = 100000000;



    int* A = (int*) aligned_alloc(16 * sizeof(int), arr_size * sizeof(int));



    int row_count = arr_size / col_count; 

    int row_count_per_thread = row_count / thread_count;



    #pragma omp parallel

    {

        int thread_id = omp_get_thread_num();



        long long total = 1ll * iteration * row_count_per_thread * col_count;

        printf("%lldn", total);



        for(int t = 0; t < iteration; t++) {



            for(int i = 0; i < row_count_per_thread; i++) {



                int start = (i * thread_count + thread_id) * col_count;

                for(int j = start; j < start + col_count; j++) {





                    if(A[j] % 2 == 0)

                        A[j] += 3;

                    else

                        A[j] += 1;

                }

            }

        }

    }



    return 0;

}

I run this code with different configurations with the following way:

time taskset -c 0-1 ./run 100 16

Here are the results for 100 iteration :

Thread      Column      Optimization        Time (secs)

_______________________________________________________

1           16          O3                  7.6

1           8           O3                  7.7

2           16          O3                  7.7

2           8           O3                  7.7



1           16          O0                  35.9

1           8           O0                  34.3

2           16          O0                  19.3

2           8           O0                  18.2

As you can see, although O3 optimization gives the best results, they are very strange because increasing the number of threads does not give any speed up. For me, O0 optimizations results are more interpretable.

The real question: Look at last 2 lines. For both cases, I got almost %100 speedup however I expect that execution time of experiment B should be much higher since it has a false-sharing issue. What is wrong with my experiment or my understanding?

I compiled it with
g++ -std=c++11 -Wall -fopenmp -O0 -o run -Iinc $(SOURCE)
and
g++ -std=c++11 -Wall -fopenmp -O3 -o run -Iinc $(SOURCE)

Let me know if my problem is not clear or need more detail.

Update: Specs:

MemTotal:        8080796 kB

Architecture:        x86_64

CPU op-mode(s):      32-bit, 64-bit

Byte Order:          Little Endian

CPU(s):              8

On-line CPU(s) list: 0-7

Thread(s) per core:  2

Core(s) per socket:  4

Socket(s):           1

NUMA node(s):        1

Vendor ID:           GenuineIntel

CPU family:          6

Model:               71

Model name:          Intel(R) Core(TM) i7-5700HQ CPU @ 2.70GHz

Stepping:            1

CPU MHz:             2622.241

CPU max MHz:         3500,0000

CPU min MHz:         800,0000

BogoMIPS:            5387.47

Virtualization:      VT-x

L1d cache:           32K

L1i cache:           32K

L2 cache:            256K

L3 cache:            6144K

NUMA node0 CPU(s):   0-7

Update 2: I have tried different iteration_count and arr_size parameters so that the array fits in the L2, L1 caches while making total number of element change constant. But the results are still the same.

Thank you.

edited Nov 14 '18 at 12:57

asked Nov 13 '18 at 11:31

Seljuk Gülcan

740216

2

Why are you testing with optimizations disabled? That means there is a lot of overhead that masks the false sharing latencies...

– Max Langhof
Nov 13 '18 at 11:39

2

Please repeat with optimization - any performance discussion without optimization is meaningless. Combing through 800 MB of data like this should never take more than 0.1 s. Also please upgrade your code to a Minimal, Complete, and Verifiable example to help with a practical answer.

– Zulan
Nov 13 '18 at 12:04

@MaxLanghof Thank you for the response. I edited the question but I got no speedup when I increased the number of threads with O3 optimization. Can you check edited question please, I added a simpler version of the code.

– Seljuk Gülcan
Nov 13 '18 at 15:18

1

Have you watched this video? It seems an exact code copy. And the answers are in the video.

– Ripi2
Nov 13 '18 at 18:01

1

@Ripi2 I have checked the video after you mention it. Thank you, I think it is a very good resource and I learnt many things from the video. Although the code is not the same, concepts are similar. However, what I experience here is the opposite of what should happen according to the video. I am asking why is so.

– Seljuk Gülcan
Nov 14 '18 at 12:53

|
show 5 more comments

I would like to see the effect of false sharing. To do so, I tried to design a small experiment but I got unexpected results.

I have an array containing 100 m integers. Consider it as m x n matrix. One thread changes odd indexed rows and other thread changes even indexed rows.

#include <iostream>

#include <omp.h>



using namespace std;



int main(int argc, char** argv) {



    if(argc != 3) {

        cout << "Usage: " << argv[0] << " <iteration> <col_count>" << endl;

        return 1;

    }



    int thread_count = omp_get_max_threads();

    int iteration = atoi(argv[1]);

    int col_count = atoi(argv[2]);

    int arr_size = 100000000;



    int* A = (int*) aligned_alloc(16 * sizeof(int), arr_size * sizeof(int));



    int row_count = arr_size / col_count; 

    int row_count_per_thread = row_count / thread_count;



    #pragma omp parallel

    {

        int thread_id = omp_get_thread_num();



        long long total = 1ll * iteration * row_count_per_thread * col_count;

        printf("%lldn", total);



        for(int t = 0; t < iteration; t++) {



            for(int i = 0; i < row_count_per_thread; i++) {



                int start = (i * thread_count + thread_id) * col_count;

                for(int j = start; j < start + col_count; j++) {





                    if(A[j] % 2 == 0)

                        A[j] += 3;

                    else

                        A[j] += 1;

                }

            }

        }

    }



    return 0;

}

I run this code with different configurations with the following way:

time taskset -c 0-1 ./run 100 16

Here are the results for 100 iteration :

Thread      Column      Optimization        Time (secs)

_______________________________________________________

1           16          O3                  7.6

1           8           O3                  7.7

2           16          O3                  7.7

2           8           O3                  7.7



1           16          O0                  35.9

1           8           O0                  34.3

2           16          O0                  19.3

2           8           O0                  18.2

I compiled it with
g++ -std=c++11 -Wall -fopenmp -O0 -o run -Iinc $(SOURCE)
and
g++ -std=c++11 -Wall -fopenmp -O3 -o run -Iinc $(SOURCE)

Let me know if my problem is not clear or need more detail.

Update: Specs:

MemTotal:        8080796 kB

Architecture:        x86_64

CPU op-mode(s):      32-bit, 64-bit

Byte Order:          Little Endian

CPU(s):              8

On-line CPU(s) list: 0-7

Thread(s) per core:  2

Core(s) per socket:  4

Socket(s):           1

NUMA node(s):        1

Vendor ID:           GenuineIntel

CPU family:          6

Model:               71

Model name:          Intel(R) Core(TM) i7-5700HQ CPU @ 2.70GHz

Stepping:            1

CPU MHz:             2622.241

CPU max MHz:         3500,0000

CPU min MHz:         800,0000

BogoMIPS:            5387.47

Virtualization:      VT-x

L1d cache:           32K

L1i cache:           32K

L2 cache:            256K

L3 cache:            6144K

NUMA node0 CPU(s):   0-7

Thank you.

edited Nov 14 '18 at 12:57

asked Nov 13 '18 at 11:31

Seljuk Gülcan

740216

2

Why are you testing with optimizations disabled? That means there is a lot of overhead that masks the false sharing latencies...

– Max Langhof
Nov 13 '18 at 11:39

2

Please repeat with optimization - any performance discussion without optimization is meaningless. Combing through 800 MB of data like this should never take more than 0.1 s. Also please upgrade your code to a Minimal, Complete, and Verifiable example to help with a practical answer.

– Zulan
Nov 13 '18 at 12:04

@MaxLanghof Thank you for the response. I edited the question but I got no speedup when I increased the number of threads with O3 optimization. Can you check edited question please, I added a simpler version of the code.

– Seljuk Gülcan
Nov 13 '18 at 15:18

1

Have you watched this video? It seems an exact code copy. And the answers are in the video.

– Ripi2
Nov 13 '18 at 18:01

1

@Ripi2 I have checked the video after you mention it. Thank you, I think it is a very good resource and I learnt many things from the video. Although the code is not the same, concepts are similar. However, what I experience here is the opposite of what should happen according to the video. I am asking why is so.

– Seljuk Gülcan
Nov 14 '18 at 12:53

|
show 5 more comments

I would like to see the effect of false sharing. To do so, I tried to design a small experiment but I got unexpected results.

I have an array containing 100 m integers. Consider it as m x n matrix. One thread changes odd indexed rows and other thread changes even indexed rows.

#include <iostream>

#include <omp.h>



using namespace std;



int main(int argc, char** argv) {



    if(argc != 3) {

        cout << "Usage: " << argv[0] << " <iteration> <col_count>" << endl;

        return 1;

    }



    int thread_count = omp_get_max_threads();

    int iteration = atoi(argv[1]);

    int col_count = atoi(argv[2]);

    int arr_size = 100000000;



    int* A = (int*) aligned_alloc(16 * sizeof(int), arr_size * sizeof(int));



    int row_count = arr_size / col_count; 

    int row_count_per_thread = row_count / thread_count;



    #pragma omp parallel

    {

        int thread_id = omp_get_thread_num();



        long long total = 1ll * iteration * row_count_per_thread * col_count;

        printf("%lldn", total);



        for(int t = 0; t < iteration; t++) {



            for(int i = 0; i < row_count_per_thread; i++) {



                int start = (i * thread_count + thread_id) * col_count;

                for(int j = start; j < start + col_count; j++) {





                    if(A[j] % 2 == 0)

                        A[j] += 3;

                    else

                        A[j] += 1;

                }

            }

        }

    }



    return 0;

}

I run this code with different configurations with the following way:

time taskset -c 0-1 ./run 100 16

Here are the results for 100 iteration :

Thread      Column      Optimization        Time (secs)

_______________________________________________________

1           16          O3                  7.6

1           8           O3                  7.7

2           16          O3                  7.7

2           8           O3                  7.7



1           16          O0                  35.9

1           8           O0                  34.3

2           16          O0                  19.3

2           8           O0                  18.2

I compiled it with
g++ -std=c++11 -Wall -fopenmp -O0 -o run -Iinc $(SOURCE)
and
g++ -std=c++11 -Wall -fopenmp -O3 -o run -Iinc $(SOURCE)

Let me know if my problem is not clear or need more detail.

Update: Specs:

MemTotal:        8080796 kB

Architecture:        x86_64

CPU op-mode(s):      32-bit, 64-bit

Byte Order:          Little Endian

CPU(s):              8

On-line CPU(s) list: 0-7

Thread(s) per core:  2

Core(s) per socket:  4

Socket(s):           1

NUMA node(s):        1

Vendor ID:           GenuineIntel

CPU family:          6

Model:               71

Model name:          Intel(R) Core(TM) i7-5700HQ CPU @ 2.70GHz

Stepping:            1

CPU MHz:             2622.241

CPU max MHz:         3500,0000

CPU min MHz:         800,0000

BogoMIPS:            5387.47

Virtualization:      VT-x

L1d cache:           32K

L1i cache:           32K

L2 cache:            256K

L3 cache:            6144K

NUMA node0 CPU(s):   0-7

Thank you.

edited Nov 14 '18 at 12:57

asked Nov 13 '18 at 11:31

Seljuk Gülcan

740216

I would like to see the effect of false sharing. To do so, I tried to design a small experiment but I got unexpected results.

I have an array containing 100 m integers. Consider it as m x n matrix. One thread changes odd indexed rows and other thread changes even indexed rows.

#include <iostream>

#include <omp.h>



using namespace std;



int main(int argc, char** argv) {



    if(argc != 3) {

        cout << "Usage: " << argv[0] << " <iteration> <col_count>" << endl;

        return 1;

    }



    int thread_count = omp_get_max_threads();

    int iteration = atoi(argv[1]);

    int col_count = atoi(argv[2]);

    int arr_size = 100000000;



    int* A = (int*) aligned_alloc(16 * sizeof(int), arr_size * sizeof(int));



    int row_count = arr_size / col_count; 

    int row_count_per_thread = row_count / thread_count;



    #pragma omp parallel

    {

        int thread_id = omp_get_thread_num();



        long long total = 1ll * iteration * row_count_per_thread * col_count;

        printf("%lldn", total);



        for(int t = 0; t < iteration; t++) {



            for(int i = 0; i < row_count_per_thread; i++) {



                int start = (i * thread_count + thread_id) * col_count;

                for(int j = start; j < start + col_count; j++) {





                    if(A[j] % 2 == 0)

                        A[j] += 3;

                    else

                        A[j] += 1;

                }

            }

        }

    }



    return 0;

}

I run this code with different configurations with the following way:

time taskset -c 0-1 ./run 100 16

Here are the results for 100 iteration :

Thread      Column      Optimization        Time (secs)

_______________________________________________________

1           16          O3                  7.6

1           8           O3                  7.7

2           16          O3                  7.7

2           8           O3                  7.7



1           16          O0                  35.9

1           8           O0                  34.3

2           16          O0                  19.3

2           8           O0                  18.2

I compiled it with
g++ -std=c++11 -Wall -fopenmp -O0 -o run -Iinc $(SOURCE)
and
g++ -std=c++11 -Wall -fopenmp -O3 -o run -Iinc $(SOURCE)

Let me know if my problem is not clear or need more detail.

Update: Specs:

MemTotal:        8080796 kB

Architecture:        x86_64

CPU op-mode(s):      32-bit, 64-bit

Byte Order:          Little Endian

CPU(s):              8

On-line CPU(s) list: 0-7

Thread(s) per core:  2

Core(s) per socket:  4

Socket(s):           1

NUMA node(s):        1

Vendor ID:           GenuineIntel

CPU family:          6

Model:               71

Model name:          Intel(R) Core(TM) i7-5700HQ CPU @ 2.70GHz

Stepping:            1

CPU MHz:             2622.241

CPU max MHz:         3500,0000

CPU min MHz:         800,0000

BogoMIPS:            5387.47

Virtualization:      VT-x

L1d cache:           32K

L1i cache:           32K

L2 cache:            256K

L3 cache:            6144K

NUMA node0 CPU(s):   0-7

Thank you.

c++ multithreading caching memory openmp

edited Nov 14 '18 at 12:57

asked Nov 13 '18 at 11:31

Seljuk Gülcan

740216

edited Nov 14 '18 at 12:57

asked Nov 13 '18 at 11:31

Seljuk Gülcan

740216

edited Nov 14 '18 at 12:57

asked Nov 13 '18 at 11:31

Seljuk Gülcan

740216

asked Nov 13 '18 at 11:31

Seljuk Gülcan

740216

asked Nov 13 '18 at 11:31

Seljuk Gülcan

740216

2

Why are you testing with optimizations disabled? That means there is a lot of overhead that masks the false sharing latencies...

– Max Langhof
Nov 13 '18 at 11:39

2

Please repeat with optimization - any performance discussion without optimization is meaningless. Combing through 800 MB of data like this should never take more than 0.1 s. Also please upgrade your code to a Minimal, Complete, and Verifiable example to help with a practical answer.

– Zulan
Nov 13 '18 at 12:04

@MaxLanghof Thank you for the response. I edited the question but I got no speedup when I increased the number of threads with O3 optimization. Can you check edited question please, I added a simpler version of the code.

– Seljuk Gülcan
Nov 13 '18 at 15:18

1

Have you watched this video? It seems an exact code copy. And the answers are in the video.

– Ripi2
Nov 13 '18 at 18:01

1

@Ripi2 I have checked the video after you mention it. Thank you, I think it is a very good resource and I learnt many things from the video. Although the code is not the same, concepts are similar. However, what I experience here is the opposite of what should happen according to the video. I am asking why is so.

– Seljuk Gülcan
Nov 14 '18 at 12:53

|
show 5 more comments

2

Why are you testing with optimizations disabled? That means there is a lot of overhead that masks the false sharing latencies...

– Max Langhof
Nov 13 '18 at 11:39

2

Please repeat with optimization - any performance discussion without optimization is meaningless. Combing through 800 MB of data like this should never take more than 0.1 s. Also please upgrade your code to a Minimal, Complete, and Verifiable example to help with a practical answer.

– Zulan
Nov 13 '18 at 12:04

@MaxLanghof Thank you for the response. I edited the question but I got no speedup when I increased the number of threads with O3 optimization. Can you check edited question please, I added a simpler version of the code.

– Seljuk Gülcan
Nov 13 '18 at 15:18

1

Have you watched this video? It seems an exact code copy. And the answers are in the video.

– Ripi2
Nov 13 '18 at 18:01

1

@Ripi2 I have checked the video after you mention it. Thank you, I think it is a very good resource and I learnt many things from the video. Although the code is not the same, concepts are similar. However, what I experience here is the opposite of what should happen according to the video. I am asking why is so.

– Seljuk Gülcan
Nov 14 '18 at 12:53

Why are you testing with optimizations disabled? That means there is a lot of overhead that masks the false sharing latencies...

– Max Langhof
Nov 13 '18 at 11:39

Please repeat with optimization - any performance discussion without optimization is meaningless. Combing through 800 MB of data like this should never take more than 0.1 s. Also please upgrade your code to a Minimal, Complete, and Verifiable example to help with a practical answer.

– Zulan
Nov 13 '18 at 12:04

@MaxLanghof Thank you for the response. I edited the question but I got no speedup when I increased the number of threads with O3 optimization. Can you check edited question please, I added a simpler version of the code.

– Seljuk Gülcan
Nov 13 '18 at 15:18

Have you watched this video? It seems an exact code copy. And the answers are in the video.

– Ripi2
Nov 13 '18 at 18:01

@Ripi2 I have checked the video after you mention it. Thank you, I think it is a very good resource and I learnt many things from the video. Although the code is not the same, concepts are similar. However, what I experience here is the opposite of what should happen according to the video. I am asking why is so.

– Seljuk Gülcan
Nov 14 '18 at 12:53

|
show 5 more comments

1 Answer
1

active

oldest

votes

+50

Your -O3 timing seems to be consistent with 1-channel memory access speed of your processor. You might probably get up to 2x better speed by using a 2-channel memory configuration, but it is unlikely to introduce any other difference to your results. Bear in mind that on your processor there is a single L3 cache shared between the cores, so any false-sharing will highly likely be resolved on the L3 cache level and won't result in additional load on the external memory bus.

There are a lot more problems (than just "slow" memory access) with your code that may prevent you from seeing the effects of false-sharing.

First, it's quite unlikely that both your threads will compete for exactly the same cache line, given the timing unpredictability involved in thread scheduling.

Second, even if they do have a conflict, it will be temporary, because any factor that leads to asymmetric slow-down will cause the "slower" thread to delay its scanning until it's out of the memory range of conflict.

Third, if they happen to run on two hardware threads of the same core, they will access exactly the same instances of the cache, and there will be no cache conflicts.

To "fix" all of this, you need more threads (or threads bound to particular cores) and a much tighter memory area for possible conflicts. The "best" results will be if your threads compete for just one cache line of memory.

edited Nov 23 '18 at 13:55

answered Nov 23 '18 at 11:44

Kit.

36928

Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.

– Seljuk Gülcan
Nov 23 '18 at 12:14

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280113%2fc-openmp-false-sharing-on-aligned-array-example%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

+50

There are a lot more problems (than just "slow" memory access) with your code that may prevent you from seeing the effects of false-sharing.

First, it's quite unlikely that both your threads will compete for exactly the same cache line, given the timing unpredictability involved in thread scheduling.

Third, if they happen to run on two hardware threads of the same core, they will access exactly the same instances of the cache, and there will be no cache conflicts.

edited Nov 23 '18 at 13:55

answered Nov 23 '18 at 11:44

Kit.

36928

Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.

– Seljuk Gülcan
Nov 23 '18 at 12:14

add a comment |

+50

There are a lot more problems (than just "slow" memory access) with your code that may prevent you from seeing the effects of false-sharing.

First, it's quite unlikely that both your threads will compete for exactly the same cache line, given the timing unpredictability involved in thread scheduling.

Third, if they happen to run on two hardware threads of the same core, they will access exactly the same instances of the cache, and there will be no cache conflicts.

edited Nov 23 '18 at 13:55

answered Nov 23 '18 at 11:44

Kit.

36928

Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.

– Seljuk Gülcan
Nov 23 '18 at 12:14

add a comment |

+50

There are a lot more problems (than just "slow" memory access) with your code that may prevent you from seeing the effects of false-sharing.

First, it's quite unlikely that both your threads will compete for exactly the same cache line, given the timing unpredictability involved in thread scheduling.

Third, if they happen to run on two hardware threads of the same core, they will access exactly the same instances of the cache, and there will be no cache conflicts.

edited Nov 23 '18 at 13:55

answered Nov 23 '18 at 11:44

Kit.

36928

There are a lot more problems (than just "slow" memory access) with your code that may prevent you from seeing the effects of false-sharing.

First, it's quite unlikely that both your threads will compete for exactly the same cache line, given the timing unpredictability involved in thread scheduling.

Third, if they happen to run on two hardware threads of the same core, they will access exactly the same instances of the cache, and there will be no cache conflicts.

edited Nov 23 '18 at 13:55

answered Nov 23 '18 at 11:44

Kit.

36928

edited Nov 23 '18 at 13:55

answered Nov 23 '18 at 11:44

Kit.

36928

answered Nov 23 '18 at 11:44

Kit.

36928

answered Nov 23 '18 at 11:44

Kit.

36928

Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.

– Seljuk Gülcan
Nov 23 '18 at 12:14

add a comment |

Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.

– Seljuk Gülcan
Nov 23 '18 at 12:14

Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.

– Seljuk Gülcan
Nov 23 '18 at 12:14

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk