How could I paralelize this function with OpenMP without race conditions nor false sharing?
I need to paralelize one funcion without race conditions nor false sharing. I have tried many ways but I could not achieve that yet. The function is:
__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
Any idea how could I achieve that?
Thank you.
c performance parallel-processing openmp
|
show 2 more comments
I need to paralelize one funcion without race conditions nor false sharing. I have tried many ways but I could not achieve that yet. The function is:
__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
Any idea how could I achieve that?
Thank you.
c performance parallel-processing openmp
What have you tried and how did it no work? What is a "career condition"?
– Zulan
Nov 15 '18 at 14:56
That means: Some threads enter to a position for writtig a data It can be solved with atomic or critical or changing secuential code for avoid it
– JuMoGar
Nov 15 '18 at 15:01
And I have tried many ways. I can say all of them. Basically I have tried change sequential code adding +1 dimension of each variable and then reduce it. I mean: (seq:)int *myVar; (omp:) int **myVar --> (sec:) myVar[i]; (omp:) myVar[omp_get_thread_num()][i]
– JuMoGar
Nov 15 '18 at 15:03
So you mean race condition, or data race sometimes also called hazard. But I have never had it referred to has career condition. Have you tried the easiest: addingomp parallel for
to the first inner, and the second outer loop?
– Zulan
Nov 15 '18 at 15:19
Yes, also have tried it. But I did not get expected result (it should be 300, but I get other nums). Oh, ok, I will update question with 'race conditions', thank you.
– JuMoGar
Nov 15 '18 at 15:31
|
show 2 more comments
I need to paralelize one funcion without race conditions nor false sharing. I have tried many ways but I could not achieve that yet. The function is:
__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
Any idea how could I achieve that?
Thank you.
c performance parallel-processing openmp
I need to paralelize one funcion without race conditions nor false sharing. I have tried many ways but I could not achieve that yet. The function is:
__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
Any idea how could I achieve that?
Thank you.
c performance parallel-processing openmp
c performance parallel-processing openmp
edited Nov 15 '18 at 15:31
JuMoGar
asked Nov 15 '18 at 14:10
JuMoGarJuMoGar
8061217
8061217
What have you tried and how did it no work? What is a "career condition"?
– Zulan
Nov 15 '18 at 14:56
That means: Some threads enter to a position for writtig a data It can be solved with atomic or critical or changing secuential code for avoid it
– JuMoGar
Nov 15 '18 at 15:01
And I have tried many ways. I can say all of them. Basically I have tried change sequential code adding +1 dimension of each variable and then reduce it. I mean: (seq:)int *myVar; (omp:) int **myVar --> (sec:) myVar[i]; (omp:) myVar[omp_get_thread_num()][i]
– JuMoGar
Nov 15 '18 at 15:03
So you mean race condition, or data race sometimes also called hazard. But I have never had it referred to has career condition. Have you tried the easiest: addingomp parallel for
to the first inner, and the second outer loop?
– Zulan
Nov 15 '18 at 15:19
Yes, also have tried it. But I did not get expected result (it should be 300, but I get other nums). Oh, ok, I will update question with 'race conditions', thank you.
– JuMoGar
Nov 15 '18 at 15:31
|
show 2 more comments
What have you tried and how did it no work? What is a "career condition"?
– Zulan
Nov 15 '18 at 14:56
That means: Some threads enter to a position for writtig a data It can be solved with atomic or critical or changing secuential code for avoid it
– JuMoGar
Nov 15 '18 at 15:01
And I have tried many ways. I can say all of them. Basically I have tried change sequential code adding +1 dimension of each variable and then reduce it. I mean: (seq:)int *myVar; (omp:) int **myVar --> (sec:) myVar[i]; (omp:) myVar[omp_get_thread_num()][i]
– JuMoGar
Nov 15 '18 at 15:03
So you mean race condition, or data race sometimes also called hazard. But I have never had it referred to has career condition. Have you tried the easiest: addingomp parallel for
to the first inner, and the second outer loop?
– Zulan
Nov 15 '18 at 15:19
Yes, also have tried it. But I did not get expected result (it should be 300, but I get other nums). Oh, ok, I will update question with 'race conditions', thank you.
– JuMoGar
Nov 15 '18 at 15:31
What have you tried and how did it no work? What is a "career condition"?
– Zulan
Nov 15 '18 at 14:56
What have you tried and how did it no work? What is a "career condition"?
– Zulan
Nov 15 '18 at 14:56
That means: Some threads enter to a position for writtig a data It can be solved with atomic or critical or changing secuential code for avoid it
– JuMoGar
Nov 15 '18 at 15:01
That means: Some threads enter to a position for writtig a data It can be solved with atomic or critical or changing secuential code for avoid it
– JuMoGar
Nov 15 '18 at 15:01
And I have tried many ways. I can say all of them. Basically I have tried change sequential code adding +1 dimension of each variable and then reduce it. I mean: (seq:)int *myVar; (omp:) int **myVar --> (sec:) myVar[i]; (omp:) myVar[omp_get_thread_num()][i]
– JuMoGar
Nov 15 '18 at 15:03
And I have tried many ways. I can say all of them. Basically I have tried change sequential code adding +1 dimension of each variable and then reduce it. I mean: (seq:)int *myVar; (omp:) int **myVar --> (sec:) myVar[i]; (omp:) myVar[omp_get_thread_num()][i]
– JuMoGar
Nov 15 '18 at 15:03
So you mean race condition, or data race sometimes also called hazard. But I have never had it referred to has career condition. Have you tried the easiest: adding
omp parallel for
to the first inner, and the second outer loop?– Zulan
Nov 15 '18 at 15:19
So you mean race condition, or data race sometimes also called hazard. But I have never had it referred to has career condition. Have you tried the easiest: adding
omp parallel for
to the first inner, and the second outer loop?– Zulan
Nov 15 '18 at 15:19
Yes, also have tried it. But I did not get expected result (it should be 300, but I get other nums). Oh, ok, I will update question with 'race conditions', thank you.
– JuMoGar
Nov 15 '18 at 15:31
Yes, also have tried it. But I did not get expected result (it should be 300, but I get other nums). Oh, ok, I will update question with 'race conditions', thank you.
– JuMoGar
Nov 15 '18 at 15:31
|
show 2 more comments
3 Answers
3
active
oldest
votes
This is pretty straight forward
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
#pragma omp parallel for
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
The inner loop is perfectly fine to parallelize as all writes happen to different elements of clustersCentroID
. You can safely assume that the default schedule will not exhibit significant false sharing, it typically has large-enough chunks. Just don't try something like schedule(static,1)
.
The outer loop is not as easy to parallelize. You could either use a reduction on clusterMemberCount
and clusterMemberCount
, or do something like:
#pragma omp parallel // note NO for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
// ensure that exactly one thread works on each cluster
if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;
Only do this if the simple solution does not yield sufficient performance.
The other loop is simple as well
#pragma omp parallel for
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i];
}
Again, data access is perfectly isolated both in terms of correctness and, except for edge cases, false sharing.
parallelizing the first outer loop withif (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;
will distribute every other value ofactiveCluster
to a different thread. This will result in some false sharing when writing onclusterAssignmentCurrent
, and potentially also when writing onclustersCentroID
if there are not that many dimensions (the edge effect will become significant ifnumCoords
is small, i.e. it the exact case when parallelizing the inner loop is not efficient).
– Brice
Nov 19 '18 at 9:51
You are right, but I did not want to make assumptions aboutnumCoords
. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.
– Zulan
Nov 19 '18 at 12:31
add a comment |
You should give an order of magnitude for the expected values of numCoords
, numObjs
and numClusters
as the optimal way to parallelize depends on that. Especially, numCoords
is important to see if parallelizing/vectorizing the inner loop over coordinates makes sense; like, are you taking of 3D coordinates or 1000 dimensions?
Another attempt with the drawback of an if
statement in the first loop (bad for performance), static schedule (possible load unbalance) but each thread incrementing contiguous parts of clusterMemberCount
and clustersCentroID
which limits the risk of false sharing.
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_num_threads() 1
#define omp_get_thread_num() 0
#endif
__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point
#pragma omp parallel
{
int nbOfThreads = omp_get_num_threads();
int thisThread = omp_get_thread_num();
// Schedule for the first step : process only cluster with ID in the [from , to[ range
int clustFrom = (thisThread*numClusters)/nbOfThreads;
int clustTo = (thisThread+1 == nbOfThreads) ? numClusters : ((thisThread+1)*numClusters)/nbOfThreads;
// Each thread will loop through all values of numObjs but only process them depending on activeCluster
// The loop is skipped only if the thread was assigned no cluster
if (clustTo>clustFrom){
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
if (activeCluster>=clustFrom && activeCluster<clustTo){
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
}
}
#pragma omp barrier
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
#pragma omp for // straightforward
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
}
free(clusterMemberCount);
}
This is a good approach. The code snippet is mixingclustFrom/To
andclusterLimits
. You might want to considernumClusters < nbOfThreads
.
– Zulan
Nov 19 '18 at 12:35
You are right about the cluster limits & from/to , I edited the answer. The code should run fine even ifnumClusters < nbOfThreads
. Of course, in that case, all threads but the last one will haveclustFrom==0
andclustTo==0
and would loop for no reason...
– Brice
Nov 19 '18 at 13:59
I moved a bit the parentheses inclustFrom
&clustTo
definition for better load balancing, and added anif
to skip the first loop if there is nothing to process, rather then loop and hit anif(condition_always_false)
– Brice
Nov 19 '18 at 14:25
add a comment |
Adding to my comment: ++clusterMemberCount[activeCluster]
forms a histogram and is problematic when two threads try to update same item (or adjacent item sharing a cache line). This needs to be taken out of the loop either as a sequential part, or must be parallelized by having a separate copy of the histogram for each thread, then combined.
You can easily separate this part from the first parallel loop.
// Make the histogram
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
++clusterMemberCount[activeCluster];
}
Then process everything exploiting parallelism
// parallel processing
#pragma omp parallel for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
The second time there's a potential false sharing is when numCoords * sizeof(clustersCentroID[0]) % 64 != 0
assuming 64-byte cache line. This can be mitigated by overallocating clustersCentroID to full multiples of 64 bytes.
// Loop for numCoords, but index by numCoordsX
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoordsX + j] += dataSetMatrix[i*numCoords + j];
Then, assuming 64-byte cache line,numCoordsX = numCoods*64
?
– JuMoGar
Nov 18 '18 at 17:27
float *clusterCentroID
means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple ofnumCoords
. But again, just as commented earlier by someone else, you should tell us the expected magnitude ofnumCoords
and all the relevant variables. Eg. what is the maximumclusterMemberCount
, what is the maximumactiveCluster
and so on.
– Aki Suihkonen
Nov 18 '18 at 18:17
There is a race condition in this code when writing atclustersCentroID
. It should be somehow protected, the same way the increment ofclusterMemberCount
has been protected against race conditions. As long as two threads can be processing points belonging to the sameactiveCluster
, writing to an array indexed byactiveCluster
is not safe.
– Brice
Nov 19 '18 at 9:57
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53321311%2fhow-could-i-paralelize-this-function-with-openmp-without-race-conditions-nor-fal%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is pretty straight forward
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
#pragma omp parallel for
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
The inner loop is perfectly fine to parallelize as all writes happen to different elements of clustersCentroID
. You can safely assume that the default schedule will not exhibit significant false sharing, it typically has large-enough chunks. Just don't try something like schedule(static,1)
.
The outer loop is not as easy to parallelize. You could either use a reduction on clusterMemberCount
and clusterMemberCount
, or do something like:
#pragma omp parallel // note NO for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
// ensure that exactly one thread works on each cluster
if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;
Only do this if the simple solution does not yield sufficient performance.
The other loop is simple as well
#pragma omp parallel for
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i];
}
Again, data access is perfectly isolated both in terms of correctness and, except for edge cases, false sharing.
parallelizing the first outer loop withif (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;
will distribute every other value ofactiveCluster
to a different thread. This will result in some false sharing when writing onclusterAssignmentCurrent
, and potentially also when writing onclustersCentroID
if there are not that many dimensions (the edge effect will become significant ifnumCoords
is small, i.e. it the exact case when parallelizing the inner loop is not efficient).
– Brice
Nov 19 '18 at 9:51
You are right, but I did not want to make assumptions aboutnumCoords
. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.
– Zulan
Nov 19 '18 at 12:31
add a comment |
This is pretty straight forward
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
#pragma omp parallel for
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
The inner loop is perfectly fine to parallelize as all writes happen to different elements of clustersCentroID
. You can safely assume that the default schedule will not exhibit significant false sharing, it typically has large-enough chunks. Just don't try something like schedule(static,1)
.
The outer loop is not as easy to parallelize. You could either use a reduction on clusterMemberCount
and clusterMemberCount
, or do something like:
#pragma omp parallel // note NO for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
// ensure that exactly one thread works on each cluster
if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;
Only do this if the simple solution does not yield sufficient performance.
The other loop is simple as well
#pragma omp parallel for
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i];
}
Again, data access is perfectly isolated both in terms of correctness and, except for edge cases, false sharing.
parallelizing the first outer loop withif (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;
will distribute every other value ofactiveCluster
to a different thread. This will result in some false sharing when writing onclusterAssignmentCurrent
, and potentially also when writing onclustersCentroID
if there are not that many dimensions (the edge effect will become significant ifnumCoords
is small, i.e. it the exact case when parallelizing the inner loop is not efficient).
– Brice
Nov 19 '18 at 9:51
You are right, but I did not want to make assumptions aboutnumCoords
. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.
– Zulan
Nov 19 '18 at 12:31
add a comment |
This is pretty straight forward
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
#pragma omp parallel for
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
The inner loop is perfectly fine to parallelize as all writes happen to different elements of clustersCentroID
. You can safely assume that the default schedule will not exhibit significant false sharing, it typically has large-enough chunks. Just don't try something like schedule(static,1)
.
The outer loop is not as easy to parallelize. You could either use a reduction on clusterMemberCount
and clusterMemberCount
, or do something like:
#pragma omp parallel // note NO for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
// ensure that exactly one thread works on each cluster
if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;
Only do this if the simple solution does not yield sufficient performance.
The other loop is simple as well
#pragma omp parallel for
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i];
}
Again, data access is perfectly isolated both in terms of correctness and, except for edge cases, false sharing.
This is pretty straight forward
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
#pragma omp parallel for
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
The inner loop is perfectly fine to parallelize as all writes happen to different elements of clustersCentroID
. You can safely assume that the default schedule will not exhibit significant false sharing, it typically has large-enough chunks. Just don't try something like schedule(static,1)
.
The outer loop is not as easy to parallelize. You could either use a reduction on clusterMemberCount
and clusterMemberCount
, or do something like:
#pragma omp parallel // note NO for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
// ensure that exactly one thread works on each cluster
if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;
Only do this if the simple solution does not yield sufficient performance.
The other loop is simple as well
#pragma omp parallel for
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i];
}
Again, data access is perfectly isolated both in terms of correctness and, except for edge cases, false sharing.
answered Nov 16 '18 at 9:03
ZulanZulan
15.4k63070
15.4k63070
parallelizing the first outer loop withif (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;
will distribute every other value ofactiveCluster
to a different thread. This will result in some false sharing when writing onclusterAssignmentCurrent
, and potentially also when writing onclustersCentroID
if there are not that many dimensions (the edge effect will become significant ifnumCoords
is small, i.e. it the exact case when parallelizing the inner loop is not efficient).
– Brice
Nov 19 '18 at 9:51
You are right, but I did not want to make assumptions aboutnumCoords
. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.
– Zulan
Nov 19 '18 at 12:31
add a comment |
parallelizing the first outer loop withif (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;
will distribute every other value ofactiveCluster
to a different thread. This will result in some false sharing when writing onclusterAssignmentCurrent
, and potentially also when writing onclustersCentroID
if there are not that many dimensions (the edge effect will become significant ifnumCoords
is small, i.e. it the exact case when parallelizing the inner loop is not efficient).
– Brice
Nov 19 '18 at 9:51
You are right, but I did not want to make assumptions aboutnumCoords
. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.
– Zulan
Nov 19 '18 at 12:31
parallelizing the first outer loop with
if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;
will distribute every other value of activeCluster
to a different thread. This will result in some false sharing when writing on clusterAssignmentCurrent
, and potentially also when writing on clustersCentroID
if there are not that many dimensions (the edge effect will become significant if numCoords
is small, i.e. it the exact case when parallelizing the inner loop is not efficient).– Brice
Nov 19 '18 at 9:51
parallelizing the first outer loop with
if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;
will distribute every other value of activeCluster
to a different thread. This will result in some false sharing when writing on clusterAssignmentCurrent
, and potentially also when writing on clustersCentroID
if there are not that many dimensions (the edge effect will become significant if numCoords
is small, i.e. it the exact case when parallelizing the inner loop is not efficient).– Brice
Nov 19 '18 at 9:51
You are right, but I did not want to make assumptions about
numCoords
. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.– Zulan
Nov 19 '18 at 12:31
You are right, but I did not want to make assumptions about
numCoords
. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.– Zulan
Nov 19 '18 at 12:31
add a comment |
You should give an order of magnitude for the expected values of numCoords
, numObjs
and numClusters
as the optimal way to parallelize depends on that. Especially, numCoords
is important to see if parallelizing/vectorizing the inner loop over coordinates makes sense; like, are you taking of 3D coordinates or 1000 dimensions?
Another attempt with the drawback of an if
statement in the first loop (bad for performance), static schedule (possible load unbalance) but each thread incrementing contiguous parts of clusterMemberCount
and clustersCentroID
which limits the risk of false sharing.
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_num_threads() 1
#define omp_get_thread_num() 0
#endif
__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point
#pragma omp parallel
{
int nbOfThreads = omp_get_num_threads();
int thisThread = omp_get_thread_num();
// Schedule for the first step : process only cluster with ID in the [from , to[ range
int clustFrom = (thisThread*numClusters)/nbOfThreads;
int clustTo = (thisThread+1 == nbOfThreads) ? numClusters : ((thisThread+1)*numClusters)/nbOfThreads;
// Each thread will loop through all values of numObjs but only process them depending on activeCluster
// The loop is skipped only if the thread was assigned no cluster
if (clustTo>clustFrom){
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
if (activeCluster>=clustFrom && activeCluster<clustTo){
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
}
}
#pragma omp barrier
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
#pragma omp for // straightforward
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
}
free(clusterMemberCount);
}
This is a good approach. The code snippet is mixingclustFrom/To
andclusterLimits
. You might want to considernumClusters < nbOfThreads
.
– Zulan
Nov 19 '18 at 12:35
You are right about the cluster limits & from/to , I edited the answer. The code should run fine even ifnumClusters < nbOfThreads
. Of course, in that case, all threads but the last one will haveclustFrom==0
andclustTo==0
and would loop for no reason...
– Brice
Nov 19 '18 at 13:59
I moved a bit the parentheses inclustFrom
&clustTo
definition for better load balancing, and added anif
to skip the first loop if there is nothing to process, rather then loop and hit anif(condition_always_false)
– Brice
Nov 19 '18 at 14:25
add a comment |
You should give an order of magnitude for the expected values of numCoords
, numObjs
and numClusters
as the optimal way to parallelize depends on that. Especially, numCoords
is important to see if parallelizing/vectorizing the inner loop over coordinates makes sense; like, are you taking of 3D coordinates or 1000 dimensions?
Another attempt with the drawback of an if
statement in the first loop (bad for performance), static schedule (possible load unbalance) but each thread incrementing contiguous parts of clusterMemberCount
and clustersCentroID
which limits the risk of false sharing.
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_num_threads() 1
#define omp_get_thread_num() 0
#endif
__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point
#pragma omp parallel
{
int nbOfThreads = omp_get_num_threads();
int thisThread = omp_get_thread_num();
// Schedule for the first step : process only cluster with ID in the [from , to[ range
int clustFrom = (thisThread*numClusters)/nbOfThreads;
int clustTo = (thisThread+1 == nbOfThreads) ? numClusters : ((thisThread+1)*numClusters)/nbOfThreads;
// Each thread will loop through all values of numObjs but only process them depending on activeCluster
// The loop is skipped only if the thread was assigned no cluster
if (clustTo>clustFrom){
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
if (activeCluster>=clustFrom && activeCluster<clustTo){
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
}
}
#pragma omp barrier
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
#pragma omp for // straightforward
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
}
free(clusterMemberCount);
}
This is a good approach. The code snippet is mixingclustFrom/To
andclusterLimits
. You might want to considernumClusters < nbOfThreads
.
– Zulan
Nov 19 '18 at 12:35
You are right about the cluster limits & from/to , I edited the answer. The code should run fine even ifnumClusters < nbOfThreads
. Of course, in that case, all threads but the last one will haveclustFrom==0
andclustTo==0
and would loop for no reason...
– Brice
Nov 19 '18 at 13:59
I moved a bit the parentheses inclustFrom
&clustTo
definition for better load balancing, and added anif
to skip the first loop if there is nothing to process, rather then loop and hit anif(condition_always_false)
– Brice
Nov 19 '18 at 14:25
add a comment |
You should give an order of magnitude for the expected values of numCoords
, numObjs
and numClusters
as the optimal way to parallelize depends on that. Especially, numCoords
is important to see if parallelizing/vectorizing the inner loop over coordinates makes sense; like, are you taking of 3D coordinates or 1000 dimensions?
Another attempt with the drawback of an if
statement in the first loop (bad for performance), static schedule (possible load unbalance) but each thread incrementing contiguous parts of clusterMemberCount
and clustersCentroID
which limits the risk of false sharing.
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_num_threads() 1
#define omp_get_thread_num() 0
#endif
__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point
#pragma omp parallel
{
int nbOfThreads = omp_get_num_threads();
int thisThread = omp_get_thread_num();
// Schedule for the first step : process only cluster with ID in the [from , to[ range
int clustFrom = (thisThread*numClusters)/nbOfThreads;
int clustTo = (thisThread+1 == nbOfThreads) ? numClusters : ((thisThread+1)*numClusters)/nbOfThreads;
// Each thread will loop through all values of numObjs but only process them depending on activeCluster
// The loop is skipped only if the thread was assigned no cluster
if (clustTo>clustFrom){
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
if (activeCluster>=clustFrom && activeCluster<clustTo){
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
}
}
#pragma omp barrier
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
#pragma omp for // straightforward
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
}
free(clusterMemberCount);
}
You should give an order of magnitude for the expected values of numCoords
, numObjs
and numClusters
as the optimal way to parallelize depends on that. Especially, numCoords
is important to see if parallelizing/vectorizing the inner loop over coordinates makes sense; like, are you taking of 3D coordinates or 1000 dimensions?
Another attempt with the drawback of an if
statement in the first loop (bad for performance), static schedule (possible load unbalance) but each thread incrementing contiguous parts of clusterMemberCount
and clustersCentroID
which limits the risk of false sharing.
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_num_threads() 1
#define omp_get_thread_num() 0
#endif
__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point
#pragma omp parallel
{
int nbOfThreads = omp_get_num_threads();
int thisThread = omp_get_thread_num();
// Schedule for the first step : process only cluster with ID in the [from , to[ range
int clustFrom = (thisThread*numClusters)/nbOfThreads;
int clustTo = (thisThread+1 == nbOfThreads) ? numClusters : ((thisThread+1)*numClusters)/nbOfThreads;
// Each thread will loop through all values of numObjs but only process them depending on activeCluster
// The loop is skipped only if the thread was assigned no cluster
if (clustTo>clustFrom){
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];
if (activeCluster>=clustFrom && activeCluster<clustTo){
// update count of members in that cluster
++clusterMemberCount[activeCluster];
// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
}
}
#pragma omp barrier
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
#pragma omp for // straightforward
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
}
free(clusterMemberCount);
}
edited Nov 19 '18 at 14:21
answered Nov 16 '18 at 9:54
BriceBrice
1,400110
1,400110
This is a good approach. The code snippet is mixingclustFrom/To
andclusterLimits
. You might want to considernumClusters < nbOfThreads
.
– Zulan
Nov 19 '18 at 12:35
You are right about the cluster limits & from/to , I edited the answer. The code should run fine even ifnumClusters < nbOfThreads
. Of course, in that case, all threads but the last one will haveclustFrom==0
andclustTo==0
and would loop for no reason...
– Brice
Nov 19 '18 at 13:59
I moved a bit the parentheses inclustFrom
&clustTo
definition for better load balancing, and added anif
to skip the first loop if there is nothing to process, rather then loop and hit anif(condition_always_false)
– Brice
Nov 19 '18 at 14:25
add a comment |
This is a good approach. The code snippet is mixingclustFrom/To
andclusterLimits
. You might want to considernumClusters < nbOfThreads
.
– Zulan
Nov 19 '18 at 12:35
You are right about the cluster limits & from/to , I edited the answer. The code should run fine even ifnumClusters < nbOfThreads
. Of course, in that case, all threads but the last one will haveclustFrom==0
andclustTo==0
and would loop for no reason...
– Brice
Nov 19 '18 at 13:59
I moved a bit the parentheses inclustFrom
&clustTo
definition for better load balancing, and added anif
to skip the first loop if there is nothing to process, rather then loop and hit anif(condition_always_false)
– Brice
Nov 19 '18 at 14:25
This is a good approach. The code snippet is mixing
clustFrom/To
and clusterLimits
. You might want to consider numClusters < nbOfThreads
.– Zulan
Nov 19 '18 at 12:35
This is a good approach. The code snippet is mixing
clustFrom/To
and clusterLimits
. You might want to consider numClusters < nbOfThreads
.– Zulan
Nov 19 '18 at 12:35
You are right about the cluster limits & from/to , I edited the answer. The code should run fine even if
numClusters < nbOfThreads
. Of course, in that case, all threads but the last one will have clustFrom==0
and clustTo==0
and would loop for no reason...– Brice
Nov 19 '18 at 13:59
You are right about the cluster limits & from/to , I edited the answer. The code should run fine even if
numClusters < nbOfThreads
. Of course, in that case, all threads but the last one will have clustFrom==0
and clustTo==0
and would loop for no reason...– Brice
Nov 19 '18 at 13:59
I moved a bit the parentheses in
clustFrom
& clustTo
definition for better load balancing, and added an if
to skip the first loop if there is nothing to process, rather then loop and hit an if(condition_always_false)
– Brice
Nov 19 '18 at 14:25
I moved a bit the parentheses in
clustFrom
& clustTo
definition for better load balancing, and added an if
to skip the first loop if there is nothing to process, rather then loop and hit an if(condition_always_false)
– Brice
Nov 19 '18 at 14:25
add a comment |
Adding to my comment: ++clusterMemberCount[activeCluster]
forms a histogram and is problematic when two threads try to update same item (or adjacent item sharing a cache line). This needs to be taken out of the loop either as a sequential part, or must be parallelized by having a separate copy of the histogram for each thread, then combined.
You can easily separate this part from the first parallel loop.
// Make the histogram
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
++clusterMemberCount[activeCluster];
}
Then process everything exploiting parallelism
// parallel processing
#pragma omp parallel for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
The second time there's a potential false sharing is when numCoords * sizeof(clustersCentroID[0]) % 64 != 0
assuming 64-byte cache line. This can be mitigated by overallocating clustersCentroID to full multiples of 64 bytes.
// Loop for numCoords, but index by numCoordsX
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoordsX + j] += dataSetMatrix[i*numCoords + j];
Then, assuming 64-byte cache line,numCoordsX = numCoods*64
?
– JuMoGar
Nov 18 '18 at 17:27
float *clusterCentroID
means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple ofnumCoords
. But again, just as commented earlier by someone else, you should tell us the expected magnitude ofnumCoords
and all the relevant variables. Eg. what is the maximumclusterMemberCount
, what is the maximumactiveCluster
and so on.
– Aki Suihkonen
Nov 18 '18 at 18:17
There is a race condition in this code when writing atclustersCentroID
. It should be somehow protected, the same way the increment ofclusterMemberCount
has been protected against race conditions. As long as two threads can be processing points belonging to the sameactiveCluster
, writing to an array indexed byactiveCluster
is not safe.
– Brice
Nov 19 '18 at 9:57
add a comment |
Adding to my comment: ++clusterMemberCount[activeCluster]
forms a histogram and is problematic when two threads try to update same item (or adjacent item sharing a cache line). This needs to be taken out of the loop either as a sequential part, or must be parallelized by having a separate copy of the histogram for each thread, then combined.
You can easily separate this part from the first parallel loop.
// Make the histogram
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
++clusterMemberCount[activeCluster];
}
Then process everything exploiting parallelism
// parallel processing
#pragma omp parallel for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
The second time there's a potential false sharing is when numCoords * sizeof(clustersCentroID[0]) % 64 != 0
assuming 64-byte cache line. This can be mitigated by overallocating clustersCentroID to full multiples of 64 bytes.
// Loop for numCoords, but index by numCoordsX
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoordsX + j] += dataSetMatrix[i*numCoords + j];
Then, assuming 64-byte cache line,numCoordsX = numCoods*64
?
– JuMoGar
Nov 18 '18 at 17:27
float *clusterCentroID
means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple ofnumCoords
. But again, just as commented earlier by someone else, you should tell us the expected magnitude ofnumCoords
and all the relevant variables. Eg. what is the maximumclusterMemberCount
, what is the maximumactiveCluster
and so on.
– Aki Suihkonen
Nov 18 '18 at 18:17
There is a race condition in this code when writing atclustersCentroID
. It should be somehow protected, the same way the increment ofclusterMemberCount
has been protected against race conditions. As long as two threads can be processing points belonging to the sameactiveCluster
, writing to an array indexed byactiveCluster
is not safe.
– Brice
Nov 19 '18 at 9:57
add a comment |
Adding to my comment: ++clusterMemberCount[activeCluster]
forms a histogram and is problematic when two threads try to update same item (or adjacent item sharing a cache line). This needs to be taken out of the loop either as a sequential part, or must be parallelized by having a separate copy of the histogram for each thread, then combined.
You can easily separate this part from the first parallel loop.
// Make the histogram
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
++clusterMemberCount[activeCluster];
}
Then process everything exploiting parallelism
// parallel processing
#pragma omp parallel for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
The second time there's a potential false sharing is when numCoords * sizeof(clustersCentroID[0]) % 64 != 0
assuming 64-byte cache line. This can be mitigated by overallocating clustersCentroID to full multiples of 64 bytes.
// Loop for numCoords, but index by numCoordsX
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoordsX + j] += dataSetMatrix[i*numCoords + j];
Adding to my comment: ++clusterMemberCount[activeCluster]
forms a histogram and is problematic when two threads try to update same item (or adjacent item sharing a cache line). This needs to be taken out of the loop either as a sequential part, or must be parallelized by having a separate copy of the histogram for each thread, then combined.
You can easily separate this part from the first parallel loop.
// Make the histogram
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
++clusterMemberCount[activeCluster];
}
Then process everything exploiting parallelism
// parallel processing
#pragma omp parallel for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
The second time there's a potential false sharing is when numCoords * sizeof(clustersCentroID[0]) % 64 != 0
assuming 64-byte cache line. This can be mitigated by overallocating clustersCentroID to full multiples of 64 bytes.
// Loop for numCoords, but index by numCoordsX
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoordsX + j] += dataSetMatrix[i*numCoords + j];
answered Nov 17 '18 at 10:17
Aki SuihkonenAki Suihkonen
14.3k12245
14.3k12245
Then, assuming 64-byte cache line,numCoordsX = numCoods*64
?
– JuMoGar
Nov 18 '18 at 17:27
float *clusterCentroID
means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple ofnumCoords
. But again, just as commented earlier by someone else, you should tell us the expected magnitude ofnumCoords
and all the relevant variables. Eg. what is the maximumclusterMemberCount
, what is the maximumactiveCluster
and so on.
– Aki Suihkonen
Nov 18 '18 at 18:17
There is a race condition in this code when writing atclustersCentroID
. It should be somehow protected, the same way the increment ofclusterMemberCount
has been protected against race conditions. As long as two threads can be processing points belonging to the sameactiveCluster
, writing to an array indexed byactiveCluster
is not safe.
– Brice
Nov 19 '18 at 9:57
add a comment |
Then, assuming 64-byte cache line,numCoordsX = numCoods*64
?
– JuMoGar
Nov 18 '18 at 17:27
float *clusterCentroID
means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple ofnumCoords
. But again, just as commented earlier by someone else, you should tell us the expected magnitude ofnumCoords
and all the relevant variables. Eg. what is the maximumclusterMemberCount
, what is the maximumactiveCluster
and so on.
– Aki Suihkonen
Nov 18 '18 at 18:17
There is a race condition in this code when writing atclustersCentroID
. It should be somehow protected, the same way the increment ofclusterMemberCount
has been protected against race conditions. As long as two threads can be processing points belonging to the sameactiveCluster
, writing to an array indexed byactiveCluster
is not safe.
– Brice
Nov 19 '18 at 9:57
Then, assuming 64-byte cache line,
numCoordsX = numCoods*64
?– JuMoGar
Nov 18 '18 at 17:27
Then, assuming 64-byte cache line,
numCoordsX = numCoods*64
?– JuMoGar
Nov 18 '18 at 17:27
float *clusterCentroID
means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple of numCoords
. But again, just as commented earlier by someone else, you should tell us the expected magnitude of numCoords
and all the relevant variables. Eg. what is the maximum clusterMemberCount
, what is the maximum activeCluster
and so on.– Aki Suihkonen
Nov 18 '18 at 18:17
float *clusterCentroID
means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple of numCoords
. But again, just as commented earlier by someone else, you should tell us the expected magnitude of numCoords
and all the relevant variables. Eg. what is the maximum clusterMemberCount
, what is the maximum activeCluster
and so on.– Aki Suihkonen
Nov 18 '18 at 18:17
There is a race condition in this code when writing at
clustersCentroID
. It should be somehow protected, the same way the increment of clusterMemberCount
has been protected against race conditions. As long as two threads can be processing points belonging to the same activeCluster
, writing to an array indexed by activeCluster
is not safe.– Brice
Nov 19 '18 at 9:57
There is a race condition in this code when writing at
clustersCentroID
. It should be somehow protected, the same way the increment of clusterMemberCount
has been protected against race conditions. As long as two threads can be processing points belonging to the same activeCluster
, writing to an array indexed by activeCluster
is not safe.– Brice
Nov 19 '18 at 9:57
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53321311%2fhow-could-i-paralelize-this-function-with-openmp-without-race-conditions-nor-fal%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What have you tried and how did it no work? What is a "career condition"?
– Zulan
Nov 15 '18 at 14:56
That means: Some threads enter to a position for writtig a data It can be solved with atomic or critical or changing secuential code for avoid it
– JuMoGar
Nov 15 '18 at 15:01
And I have tried many ways. I can say all of them. Basically I have tried change sequential code adding +1 dimension of each variable and then reduce it. I mean: (seq:)int *myVar; (omp:) int **myVar --> (sec:) myVar[i]; (omp:) myVar[omp_get_thread_num()][i]
– JuMoGar
Nov 15 '18 at 15:03
So you mean race condition, or data race sometimes also called hazard. But I have never had it referred to has career condition. Have you tried the easiest: adding
omp parallel for
to the first inner, and the second outer loop?– Zulan
Nov 15 '18 at 15:19
Yes, also have tried it. But I did not get expected result (it should be 300, but I get other nums). Oh, ok, I will update question with 'race conditions', thank you.
– JuMoGar
Nov 15 '18 at 15:31