c++ openmp false-sharing on aligned array example
I would like to see the effect of false sharing. To do so, I tried to design a small experiment but I got unexpected results.
I have an array containing 100 m integers. Consider it as m x n matrix. One thread changes odd indexed rows and other thread changes even indexed rows.
Experiment A: The number of columns is 16. So each row is 64 bytes, it is exactly my cacheline size. Since each thread processes exactly 1 cacheline at a time, there should be no false-sharing. Therefore, I expect around 100% speedup.
Experiment B: The number of columns is 8. Each thread changes 32 bytes at a time, which is half of cacheline. For example, if thread 1 processes row 33, data should be transferred from thread 0 because thread 1 has already processed row 32 which is in the same cacheline. (or vice versa, the order does not matter). Because of this communitcation, speedup should be low.
#include <iostream>
#include <omp.h>
using namespace std;
int main(int argc, char** argv) {
if(argc != 3) {
cout << "Usage: " << argv[0] << " <iteration> <col_count>" << endl;
return 1;
}
int thread_count = omp_get_max_threads();
int iteration = atoi(argv[1]);
int col_count = atoi(argv[2]);
int arr_size = 100000000;
int* A = (int*) aligned_alloc(16 * sizeof(int), arr_size * sizeof(int));
int row_count = arr_size / col_count;
int row_count_per_thread = row_count / thread_count;
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
long long total = 1ll * iteration * row_count_per_thread * col_count;
printf("%lldn", total);
for(int t = 0; t < iteration; t++) {
for(int i = 0; i < row_count_per_thread; i++) {
int start = (i * thread_count + thread_id) * col_count;
for(int j = start; j < start + col_count; j++) {
if(A[j] % 2 == 0)
A[j] += 3;
else
A[j] += 1;
}
}
}
}
return 0;
}
I run this code with different configurations with the following way:
time taskset -c 0-1 ./run 100 16
Here are the results for 100 iteration :
Thread Column Optimization Time (secs)
_______________________________________________________
1 16 O3 7.6
1 8 O3 7.7
2 16 O3 7.7
2 8 O3 7.7
1 16 O0 35.9
1 8 O0 34.3
2 16 O0 19.3
2 8 O0 18.2
As you can see, although O3 optimization gives the best results, they are very strange because increasing the number of threads does not give any speed up. For me, O0 optimizations results are more interpretable.
The real question: Look at last 2 lines. For both cases, I got almost %100 speedup however I expect that execution time of experiment B should be much higher since it has a false-sharing issue. What is wrong with my experiment or my understanding?
I compiled it with
g++ -std=c++11 -Wall -fopenmp -O0 -o run -Iinc $(SOURCE)
and
g++ -std=c++11 -Wall -fopenmp -O3 -o run -Iinc $(SOURCE)
Let me know if my problem is not clear or need more detail.
Update: Specs:
MemTotal: 8080796 kB
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 71
Model name: Intel(R) Core(TM) i7-5700HQ CPU @ 2.70GHz
Stepping: 1
CPU MHz: 2622.241
CPU max MHz: 3500,0000
CPU min MHz: 800,0000
BogoMIPS: 5387.47
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
Update 2: I have tried different iteration_count
and arr_size
parameters so that the array fits in the L2, L1 caches while making total number of element change constant. But the results are still the same.
Thank you.
c++ multithreading caching memory openmp
|
show 5 more comments
I would like to see the effect of false sharing. To do so, I tried to design a small experiment but I got unexpected results.
I have an array containing 100 m integers. Consider it as m x n matrix. One thread changes odd indexed rows and other thread changes even indexed rows.
Experiment A: The number of columns is 16. So each row is 64 bytes, it is exactly my cacheline size. Since each thread processes exactly 1 cacheline at a time, there should be no false-sharing. Therefore, I expect around 100% speedup.
Experiment B: The number of columns is 8. Each thread changes 32 bytes at a time, which is half of cacheline. For example, if thread 1 processes row 33, data should be transferred from thread 0 because thread 1 has already processed row 32 which is in the same cacheline. (or vice versa, the order does not matter). Because of this communitcation, speedup should be low.
#include <iostream>
#include <omp.h>
using namespace std;
int main(int argc, char** argv) {
if(argc != 3) {
cout << "Usage: " << argv[0] << " <iteration> <col_count>" << endl;
return 1;
}
int thread_count = omp_get_max_threads();
int iteration = atoi(argv[1]);
int col_count = atoi(argv[2]);
int arr_size = 100000000;
int* A = (int*) aligned_alloc(16 * sizeof(int), arr_size * sizeof(int));
int row_count = arr_size / col_count;
int row_count_per_thread = row_count / thread_count;
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
long long total = 1ll * iteration * row_count_per_thread * col_count;
printf("%lldn", total);
for(int t = 0; t < iteration; t++) {
for(int i = 0; i < row_count_per_thread; i++) {
int start = (i * thread_count + thread_id) * col_count;
for(int j = start; j < start + col_count; j++) {
if(A[j] % 2 == 0)
A[j] += 3;
else
A[j] += 1;
}
}
}
}
return 0;
}
I run this code with different configurations with the following way:
time taskset -c 0-1 ./run 100 16
Here are the results for 100 iteration :
Thread Column Optimization Time (secs)
_______________________________________________________
1 16 O3 7.6
1 8 O3 7.7
2 16 O3 7.7
2 8 O3 7.7
1 16 O0 35.9
1 8 O0 34.3
2 16 O0 19.3
2 8 O0 18.2
As you can see, although O3 optimization gives the best results, they are very strange because increasing the number of threads does not give any speed up. For me, O0 optimizations results are more interpretable.
The real question: Look at last 2 lines. For both cases, I got almost %100 speedup however I expect that execution time of experiment B should be much higher since it has a false-sharing issue. What is wrong with my experiment or my understanding?
I compiled it with
g++ -std=c++11 -Wall -fopenmp -O0 -o run -Iinc $(SOURCE)
and
g++ -std=c++11 -Wall -fopenmp -O3 -o run -Iinc $(SOURCE)
Let me know if my problem is not clear or need more detail.
Update: Specs:
MemTotal: 8080796 kB
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 71
Model name: Intel(R) Core(TM) i7-5700HQ CPU @ 2.70GHz
Stepping: 1
CPU MHz: 2622.241
CPU max MHz: 3500,0000
CPU min MHz: 800,0000
BogoMIPS: 5387.47
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
Update 2: I have tried different iteration_count
and arr_size
parameters so that the array fits in the L2, L1 caches while making total number of element change constant. But the results are still the same.
Thank you.
c++ multithreading caching memory openmp
2
Why are you testing with optimizations disabled? That means there is a lot of overhead that masks the false sharing latencies...
– Max Langhof
Nov 13 '18 at 11:39
2
Please repeat with optimization - any performance discussion without optimization is meaningless. Combing through 800 MB of data like this should never take more than 0.1 s. Also please upgrade your code to a Minimal, Complete, and Verifiable example to help with a practical answer.
– Zulan
Nov 13 '18 at 12:04
@MaxLanghof Thank you for the response. I edited the question but I got no speedup when I increased the number of threads with O3 optimization. Can you check edited question please, I added a simpler version of the code.
– Seljuk Gülcan
Nov 13 '18 at 15:18
1
Have you watched this video? It seems an exact code copy. And the answers are in the video.
– Ripi2
Nov 13 '18 at 18:01
1
@Ripi2 I have checked the video after you mention it. Thank you, I think it is a very good resource and I learnt many things from the video. Although the code is not the same, concepts are similar. However, what I experience here is the opposite of what should happen according to the video. I am asking why is so.
– Seljuk Gülcan
Nov 14 '18 at 12:53
|
show 5 more comments
I would like to see the effect of false sharing. To do so, I tried to design a small experiment but I got unexpected results.
I have an array containing 100 m integers. Consider it as m x n matrix. One thread changes odd indexed rows and other thread changes even indexed rows.
Experiment A: The number of columns is 16. So each row is 64 bytes, it is exactly my cacheline size. Since each thread processes exactly 1 cacheline at a time, there should be no false-sharing. Therefore, I expect around 100% speedup.
Experiment B: The number of columns is 8. Each thread changes 32 bytes at a time, which is half of cacheline. For example, if thread 1 processes row 33, data should be transferred from thread 0 because thread 1 has already processed row 32 which is in the same cacheline. (or vice versa, the order does not matter). Because of this communitcation, speedup should be low.
#include <iostream>
#include <omp.h>
using namespace std;
int main(int argc, char** argv) {
if(argc != 3) {
cout << "Usage: " << argv[0] << " <iteration> <col_count>" << endl;
return 1;
}
int thread_count = omp_get_max_threads();
int iteration = atoi(argv[1]);
int col_count = atoi(argv[2]);
int arr_size = 100000000;
int* A = (int*) aligned_alloc(16 * sizeof(int), arr_size * sizeof(int));
int row_count = arr_size / col_count;
int row_count_per_thread = row_count / thread_count;
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
long long total = 1ll * iteration * row_count_per_thread * col_count;
printf("%lldn", total);
for(int t = 0; t < iteration; t++) {
for(int i = 0; i < row_count_per_thread; i++) {
int start = (i * thread_count + thread_id) * col_count;
for(int j = start; j < start + col_count; j++) {
if(A[j] % 2 == 0)
A[j] += 3;
else
A[j] += 1;
}
}
}
}
return 0;
}
I run this code with different configurations with the following way:
time taskset -c 0-1 ./run 100 16
Here are the results for 100 iteration :
Thread Column Optimization Time (secs)
_______________________________________________________
1 16 O3 7.6
1 8 O3 7.7
2 16 O3 7.7
2 8 O3 7.7
1 16 O0 35.9
1 8 O0 34.3
2 16 O0 19.3
2 8 O0 18.2
As you can see, although O3 optimization gives the best results, they are very strange because increasing the number of threads does not give any speed up. For me, O0 optimizations results are more interpretable.
The real question: Look at last 2 lines. For both cases, I got almost %100 speedup however I expect that execution time of experiment B should be much higher since it has a false-sharing issue. What is wrong with my experiment or my understanding?
I compiled it with
g++ -std=c++11 -Wall -fopenmp -O0 -o run -Iinc $(SOURCE)
and
g++ -std=c++11 -Wall -fopenmp -O3 -o run -Iinc $(SOURCE)
Let me know if my problem is not clear or need more detail.
Update: Specs:
MemTotal: 8080796 kB
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 71
Model name: Intel(R) Core(TM) i7-5700HQ CPU @ 2.70GHz
Stepping: 1
CPU MHz: 2622.241
CPU max MHz: 3500,0000
CPU min MHz: 800,0000
BogoMIPS: 5387.47
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
Update 2: I have tried different iteration_count
and arr_size
parameters so that the array fits in the L2, L1 caches while making total number of element change constant. But the results are still the same.
Thank you.
c++ multithreading caching memory openmp
I would like to see the effect of false sharing. To do so, I tried to design a small experiment but I got unexpected results.
I have an array containing 100 m integers. Consider it as m x n matrix. One thread changes odd indexed rows and other thread changes even indexed rows.
Experiment A: The number of columns is 16. So each row is 64 bytes, it is exactly my cacheline size. Since each thread processes exactly 1 cacheline at a time, there should be no false-sharing. Therefore, I expect around 100% speedup.
Experiment B: The number of columns is 8. Each thread changes 32 bytes at a time, which is half of cacheline. For example, if thread 1 processes row 33, data should be transferred from thread 0 because thread 1 has already processed row 32 which is in the same cacheline. (or vice versa, the order does not matter). Because of this communitcation, speedup should be low.
#include <iostream>
#include <omp.h>
using namespace std;
int main(int argc, char** argv) {
if(argc != 3) {
cout << "Usage: " << argv[0] << " <iteration> <col_count>" << endl;
return 1;
}
int thread_count = omp_get_max_threads();
int iteration = atoi(argv[1]);
int col_count = atoi(argv[2]);
int arr_size = 100000000;
int* A = (int*) aligned_alloc(16 * sizeof(int), arr_size * sizeof(int));
int row_count = arr_size / col_count;
int row_count_per_thread = row_count / thread_count;
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
long long total = 1ll * iteration * row_count_per_thread * col_count;
printf("%lldn", total);
for(int t = 0; t < iteration; t++) {
for(int i = 0; i < row_count_per_thread; i++) {
int start = (i * thread_count + thread_id) * col_count;
for(int j = start; j < start + col_count; j++) {
if(A[j] % 2 == 0)
A[j] += 3;
else
A[j] += 1;
}
}
}
}
return 0;
}
I run this code with different configurations with the following way:
time taskset -c 0-1 ./run 100 16
Here are the results for 100 iteration :
Thread Column Optimization Time (secs)
_______________________________________________________
1 16 O3 7.6
1 8 O3 7.7
2 16 O3 7.7
2 8 O3 7.7
1 16 O0 35.9
1 8 O0 34.3
2 16 O0 19.3
2 8 O0 18.2
As you can see, although O3 optimization gives the best results, they are very strange because increasing the number of threads does not give any speed up. For me, O0 optimizations results are more interpretable.
The real question: Look at last 2 lines. For both cases, I got almost %100 speedup however I expect that execution time of experiment B should be much higher since it has a false-sharing issue. What is wrong with my experiment or my understanding?
I compiled it with
g++ -std=c++11 -Wall -fopenmp -O0 -o run -Iinc $(SOURCE)
and
g++ -std=c++11 -Wall -fopenmp -O3 -o run -Iinc $(SOURCE)
Let me know if my problem is not clear or need more detail.
Update: Specs:
MemTotal: 8080796 kB
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 71
Model name: Intel(R) Core(TM) i7-5700HQ CPU @ 2.70GHz
Stepping: 1
CPU MHz: 2622.241
CPU max MHz: 3500,0000
CPU min MHz: 800,0000
BogoMIPS: 5387.47
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
Update 2: I have tried different iteration_count
and arr_size
parameters so that the array fits in the L2, L1 caches while making total number of element change constant. But the results are still the same.
Thank you.
c++ multithreading caching memory openmp
c++ multithreading caching memory openmp
edited Nov 14 '18 at 12:57
Seljuk Gülcan
asked Nov 13 '18 at 11:31
Seljuk GülcanSeljuk Gülcan
740216
740216
2
Why are you testing with optimizations disabled? That means there is a lot of overhead that masks the false sharing latencies...
– Max Langhof
Nov 13 '18 at 11:39
2
Please repeat with optimization - any performance discussion without optimization is meaningless. Combing through 800 MB of data like this should never take more than 0.1 s. Also please upgrade your code to a Minimal, Complete, and Verifiable example to help with a practical answer.
– Zulan
Nov 13 '18 at 12:04
@MaxLanghof Thank you for the response. I edited the question but I got no speedup when I increased the number of threads with O3 optimization. Can you check edited question please, I added a simpler version of the code.
– Seljuk Gülcan
Nov 13 '18 at 15:18
1
Have you watched this video? It seems an exact code copy. And the answers are in the video.
– Ripi2
Nov 13 '18 at 18:01
1
@Ripi2 I have checked the video after you mention it. Thank you, I think it is a very good resource and I learnt many things from the video. Although the code is not the same, concepts are similar. However, what I experience here is the opposite of what should happen according to the video. I am asking why is so.
– Seljuk Gülcan
Nov 14 '18 at 12:53
|
show 5 more comments
2
Why are you testing with optimizations disabled? That means there is a lot of overhead that masks the false sharing latencies...
– Max Langhof
Nov 13 '18 at 11:39
2
Please repeat with optimization - any performance discussion without optimization is meaningless. Combing through 800 MB of data like this should never take more than 0.1 s. Also please upgrade your code to a Minimal, Complete, and Verifiable example to help with a practical answer.
– Zulan
Nov 13 '18 at 12:04
@MaxLanghof Thank you for the response. I edited the question but I got no speedup when I increased the number of threads with O3 optimization. Can you check edited question please, I added a simpler version of the code.
– Seljuk Gülcan
Nov 13 '18 at 15:18
1
Have you watched this video? It seems an exact code copy. And the answers are in the video.
– Ripi2
Nov 13 '18 at 18:01
1
@Ripi2 I have checked the video after you mention it. Thank you, I think it is a very good resource and I learnt many things from the video. Although the code is not the same, concepts are similar. However, what I experience here is the opposite of what should happen according to the video. I am asking why is so.
– Seljuk Gülcan
Nov 14 '18 at 12:53
2
2
Why are you testing with optimizations disabled? That means there is a lot of overhead that masks the false sharing latencies...
– Max Langhof
Nov 13 '18 at 11:39
Why are you testing with optimizations disabled? That means there is a lot of overhead that masks the false sharing latencies...
– Max Langhof
Nov 13 '18 at 11:39
2
2
Please repeat with optimization - any performance discussion without optimization is meaningless. Combing through 800 MB of data like this should never take more than 0.1 s. Also please upgrade your code to a Minimal, Complete, and Verifiable example to help with a practical answer.
– Zulan
Nov 13 '18 at 12:04
Please repeat with optimization - any performance discussion without optimization is meaningless. Combing through 800 MB of data like this should never take more than 0.1 s. Also please upgrade your code to a Minimal, Complete, and Verifiable example to help with a practical answer.
– Zulan
Nov 13 '18 at 12:04
@MaxLanghof Thank you for the response. I edited the question but I got no speedup when I increased the number of threads with O3 optimization. Can you check edited question please, I added a simpler version of the code.
– Seljuk Gülcan
Nov 13 '18 at 15:18
@MaxLanghof Thank you for the response. I edited the question but I got no speedup when I increased the number of threads with O3 optimization. Can you check edited question please, I added a simpler version of the code.
– Seljuk Gülcan
Nov 13 '18 at 15:18
1
1
Have you watched this video? It seems an exact code copy. And the answers are in the video.
– Ripi2
Nov 13 '18 at 18:01
Have you watched this video? It seems an exact code copy. And the answers are in the video.
– Ripi2
Nov 13 '18 at 18:01
1
1
@Ripi2 I have checked the video after you mention it. Thank you, I think it is a very good resource and I learnt many things from the video. Although the code is not the same, concepts are similar. However, what I experience here is the opposite of what should happen according to the video. I am asking why is so.
– Seljuk Gülcan
Nov 14 '18 at 12:53
@Ripi2 I have checked the video after you mention it. Thank you, I think it is a very good resource and I learnt many things from the video. Although the code is not the same, concepts are similar. However, what I experience here is the opposite of what should happen according to the video. I am asking why is so.
– Seljuk Gülcan
Nov 14 '18 at 12:53
|
show 5 more comments
1 Answer
1
active
oldest
votes
Your -O3 timing seems to be consistent with 1-channel memory access speed of your processor. You might probably get up to 2x better speed by using a 2-channel memory configuration, but it is unlikely to introduce any other difference to your results. Bear in mind that on your processor there is a single L3 cache shared between the cores, so any false-sharing will highly likely be resolved on the L3 cache level and won't result in additional load on the external memory bus.
There are a lot more problems (than just "slow" memory access) with your code that may prevent you from seeing the effects of false-sharing.
First, it's quite unlikely that both your threads will compete for exactly the same cache line, given the timing unpredictability involved in thread scheduling.
Second, even if they do have a conflict, it will be temporary, because any factor that leads to asymmetric slow-down will cause the "slower" thread to delay its scanning until it's out of the memory range of conflict.
Third, if they happen to run on two hardware threads of the same core, they will access exactly the same instances of the cache, and there will be no cache conflicts.
To "fix" all of this, you need more threads (or threads bound to particular cores) and a much tighter memory area for possible conflicts. The "best" results will be if your threads compete for just one cache line of memory.
Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.
– Seljuk Gülcan
Nov 23 '18 at 12:14
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280113%2fc-openmp-false-sharing-on-aligned-array-example%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Your -O3 timing seems to be consistent with 1-channel memory access speed of your processor. You might probably get up to 2x better speed by using a 2-channel memory configuration, but it is unlikely to introduce any other difference to your results. Bear in mind that on your processor there is a single L3 cache shared between the cores, so any false-sharing will highly likely be resolved on the L3 cache level and won't result in additional load on the external memory bus.
There are a lot more problems (than just "slow" memory access) with your code that may prevent you from seeing the effects of false-sharing.
First, it's quite unlikely that both your threads will compete for exactly the same cache line, given the timing unpredictability involved in thread scheduling.
Second, even if they do have a conflict, it will be temporary, because any factor that leads to asymmetric slow-down will cause the "slower" thread to delay its scanning until it's out of the memory range of conflict.
Third, if they happen to run on two hardware threads of the same core, they will access exactly the same instances of the cache, and there will be no cache conflicts.
To "fix" all of this, you need more threads (or threads bound to particular cores) and a much tighter memory area for possible conflicts. The "best" results will be if your threads compete for just one cache line of memory.
Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.
– Seljuk Gülcan
Nov 23 '18 at 12:14
add a comment |
Your -O3 timing seems to be consistent with 1-channel memory access speed of your processor. You might probably get up to 2x better speed by using a 2-channel memory configuration, but it is unlikely to introduce any other difference to your results. Bear in mind that on your processor there is a single L3 cache shared between the cores, so any false-sharing will highly likely be resolved on the L3 cache level and won't result in additional load on the external memory bus.
There are a lot more problems (than just "slow" memory access) with your code that may prevent you from seeing the effects of false-sharing.
First, it's quite unlikely that both your threads will compete for exactly the same cache line, given the timing unpredictability involved in thread scheduling.
Second, even if they do have a conflict, it will be temporary, because any factor that leads to asymmetric slow-down will cause the "slower" thread to delay its scanning until it's out of the memory range of conflict.
Third, if they happen to run on two hardware threads of the same core, they will access exactly the same instances of the cache, and there will be no cache conflicts.
To "fix" all of this, you need more threads (or threads bound to particular cores) and a much tighter memory area for possible conflicts. The "best" results will be if your threads compete for just one cache line of memory.
Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.
– Seljuk Gülcan
Nov 23 '18 at 12:14
add a comment |
Your -O3 timing seems to be consistent with 1-channel memory access speed of your processor. You might probably get up to 2x better speed by using a 2-channel memory configuration, but it is unlikely to introduce any other difference to your results. Bear in mind that on your processor there is a single L3 cache shared between the cores, so any false-sharing will highly likely be resolved on the L3 cache level and won't result in additional load on the external memory bus.
There are a lot more problems (than just "slow" memory access) with your code that may prevent you from seeing the effects of false-sharing.
First, it's quite unlikely that both your threads will compete for exactly the same cache line, given the timing unpredictability involved in thread scheduling.
Second, even if they do have a conflict, it will be temporary, because any factor that leads to asymmetric slow-down will cause the "slower" thread to delay its scanning until it's out of the memory range of conflict.
Third, if they happen to run on two hardware threads of the same core, they will access exactly the same instances of the cache, and there will be no cache conflicts.
To "fix" all of this, you need more threads (or threads bound to particular cores) and a much tighter memory area for possible conflicts. The "best" results will be if your threads compete for just one cache line of memory.
Your -O3 timing seems to be consistent with 1-channel memory access speed of your processor. You might probably get up to 2x better speed by using a 2-channel memory configuration, but it is unlikely to introduce any other difference to your results. Bear in mind that on your processor there is a single L3 cache shared between the cores, so any false-sharing will highly likely be resolved on the L3 cache level and won't result in additional load on the external memory bus.
There are a lot more problems (than just "slow" memory access) with your code that may prevent you from seeing the effects of false-sharing.
First, it's quite unlikely that both your threads will compete for exactly the same cache line, given the timing unpredictability involved in thread scheduling.
Second, even if they do have a conflict, it will be temporary, because any factor that leads to asymmetric slow-down will cause the "slower" thread to delay its scanning until it's out of the memory range of conflict.
Third, if they happen to run on two hardware threads of the same core, they will access exactly the same instances of the cache, and there will be no cache conflicts.
To "fix" all of this, you need more threads (or threads bound to particular cores) and a much tighter memory area for possible conflicts. The "best" results will be if your threads compete for just one cache line of memory.
edited Nov 23 '18 at 13:55
answered Nov 23 '18 at 11:44
Kit.Kit.
36928
36928
Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.
– Seljuk Gülcan
Nov 23 '18 at 12:14
add a comment |
Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.
– Seljuk Gülcan
Nov 23 '18 at 12:14
Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.
– Seljuk Gülcan
Nov 23 '18 at 12:14
Thanks. In all of my experiments, each thread is bounded to a different physical core, so I think your third point is not an issue. I'll do experiments with different number of cores and varying memory areas on a different architecture where the number of cores is much higher. I'll update when I get results.
– Seljuk Gülcan
Nov 23 '18 at 12:14
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280113%2fc-openmp-false-sharing-on-aligned-array-example%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
Why are you testing with optimizations disabled? That means there is a lot of overhead that masks the false sharing latencies...
– Max Langhof
Nov 13 '18 at 11:39
2
Please repeat with optimization - any performance discussion without optimization is meaningless. Combing through 800 MB of data like this should never take more than 0.1 s. Also please upgrade your code to a Minimal, Complete, and Verifiable example to help with a practical answer.
– Zulan
Nov 13 '18 at 12:04
@MaxLanghof Thank you for the response. I edited the question but I got no speedup when I increased the number of threads with O3 optimization. Can you check edited question please, I added a simpler version of the code.
– Seljuk Gülcan
Nov 13 '18 at 15:18
1
Have you watched this video? It seems an exact code copy. And the answers are in the video.
– Ripi2
Nov 13 '18 at 18:01
1
@Ripi2 I have checked the video after you mention it. Thank you, I think it is a very good resource and I learnt many things from the video. Although the code is not the same, concepts are similar. However, what I experience here is the opposite of what should happen according to the video. I am asking why is so.
– Seljuk Gülcan
Nov 14 '18 at 12:53