How could I paralelize this function with OpenMP without race conditions nor false sharing?












0















I need to paralelize one funcion without race conditions nor false sharing. I have tried many ways but I could not achieve that yet. The function is:



__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));

// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}


Any idea how could I achieve that?



Thank you.










share|improve this question

























  • What have you tried and how did it no work? What is a "career condition"?

    – Zulan
    Nov 15 '18 at 14:56











  • That means: Some threads enter to a position for writtig a data It can be solved with atomic or critical or changing secuential code for avoid it

    – JuMoGar
    Nov 15 '18 at 15:01











  • And I have tried many ways. I can say all of them. Basically I have tried change sequential code adding +1 dimension of each variable and then reduce it. I mean: (seq:)int *myVar; (omp:) int **myVar --> (sec:) myVar[i]; (omp:) myVar[omp_get_thread_num()][i]

    – JuMoGar
    Nov 15 '18 at 15:03











  • So you mean race condition, or data race sometimes also called hazard. But I have never had it referred to has career condition. Have you tried the easiest: adding omp parallel for to the first inner, and the second outer loop?

    – Zulan
    Nov 15 '18 at 15:19











  • Yes, also have tried it. But I did not get expected result (it should be 300, but I get other nums). Oh, ok, I will update question with 'race conditions', thank you.

    – JuMoGar
    Nov 15 '18 at 15:31
















0















I need to paralelize one funcion without race conditions nor false sharing. I have tried many ways but I could not achieve that yet. The function is:



__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));

// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}


Any idea how could I achieve that?



Thank you.










share|improve this question

























  • What have you tried and how did it no work? What is a "career condition"?

    – Zulan
    Nov 15 '18 at 14:56











  • That means: Some threads enter to a position for writtig a data It can be solved with atomic or critical or changing secuential code for avoid it

    – JuMoGar
    Nov 15 '18 at 15:01











  • And I have tried many ways. I can say all of them. Basically I have tried change sequential code adding +1 dimension of each variable and then reduce it. I mean: (seq:)int *myVar; (omp:) int **myVar --> (sec:) myVar[i]; (omp:) myVar[omp_get_thread_num()][i]

    – JuMoGar
    Nov 15 '18 at 15:03











  • So you mean race condition, or data race sometimes also called hazard. But I have never had it referred to has career condition. Have you tried the easiest: adding omp parallel for to the first inner, and the second outer loop?

    – Zulan
    Nov 15 '18 at 15:19











  • Yes, also have tried it. But I did not get expected result (it should be 300, but I get other nums). Oh, ok, I will update question with 'race conditions', thank you.

    – JuMoGar
    Nov 15 '18 at 15:31














0












0








0








I need to paralelize one funcion without race conditions nor false sharing. I have tried many ways but I could not achieve that yet. The function is:



__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));

// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}


Any idea how could I achieve that?



Thank you.










share|improve this question
















I need to paralelize one funcion without race conditions nor false sharing. I have tried many ways but I could not achieve that yet. The function is:



__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));

// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}


Any idea how could I achieve that?



Thank you.







c performance parallel-processing openmp






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 15:31







JuMoGar

















asked Nov 15 '18 at 14:10









JuMoGarJuMoGar

8061217




8061217













  • What have you tried and how did it no work? What is a "career condition"?

    – Zulan
    Nov 15 '18 at 14:56











  • That means: Some threads enter to a position for writtig a data It can be solved with atomic or critical or changing secuential code for avoid it

    – JuMoGar
    Nov 15 '18 at 15:01











  • And I have tried many ways. I can say all of them. Basically I have tried change sequential code adding +1 dimension of each variable and then reduce it. I mean: (seq:)int *myVar; (omp:) int **myVar --> (sec:) myVar[i]; (omp:) myVar[omp_get_thread_num()][i]

    – JuMoGar
    Nov 15 '18 at 15:03











  • So you mean race condition, or data race sometimes also called hazard. But I have never had it referred to has career condition. Have you tried the easiest: adding omp parallel for to the first inner, and the second outer loop?

    – Zulan
    Nov 15 '18 at 15:19











  • Yes, also have tried it. But I did not get expected result (it should be 300, but I get other nums). Oh, ok, I will update question with 'race conditions', thank you.

    – JuMoGar
    Nov 15 '18 at 15:31



















  • What have you tried and how did it no work? What is a "career condition"?

    – Zulan
    Nov 15 '18 at 14:56











  • That means: Some threads enter to a position for writtig a data It can be solved with atomic or critical or changing secuential code for avoid it

    – JuMoGar
    Nov 15 '18 at 15:01











  • And I have tried many ways. I can say all of them. Basically I have tried change sequential code adding +1 dimension of each variable and then reduce it. I mean: (seq:)int *myVar; (omp:) int **myVar --> (sec:) myVar[i]; (omp:) myVar[omp_get_thread_num()][i]

    – JuMoGar
    Nov 15 '18 at 15:03











  • So you mean race condition, or data race sometimes also called hazard. But I have never had it referred to has career condition. Have you tried the easiest: adding omp parallel for to the first inner, and the second outer loop?

    – Zulan
    Nov 15 '18 at 15:19











  • Yes, also have tried it. But I did not get expected result (it should be 300, but I get other nums). Oh, ok, I will update question with 'race conditions', thank you.

    – JuMoGar
    Nov 15 '18 at 15:31

















What have you tried and how did it no work? What is a "career condition"?

– Zulan
Nov 15 '18 at 14:56





What have you tried and how did it no work? What is a "career condition"?

– Zulan
Nov 15 '18 at 14:56













That means: Some threads enter to a position for writtig a data It can be solved with atomic or critical or changing secuential code for avoid it

– JuMoGar
Nov 15 '18 at 15:01





That means: Some threads enter to a position for writtig a data It can be solved with atomic or critical or changing secuential code for avoid it

– JuMoGar
Nov 15 '18 at 15:01













And I have tried many ways. I can say all of them. Basically I have tried change sequential code adding +1 dimension of each variable and then reduce it. I mean: (seq:)int *myVar; (omp:) int **myVar --> (sec:) myVar[i]; (omp:) myVar[omp_get_thread_num()][i]

– JuMoGar
Nov 15 '18 at 15:03





And I have tried many ways. I can say all of them. Basically I have tried change sequential code adding +1 dimension of each variable and then reduce it. I mean: (seq:)int *myVar; (omp:) int **myVar --> (sec:) myVar[i]; (omp:) myVar[omp_get_thread_num()][i]

– JuMoGar
Nov 15 '18 at 15:03













So you mean race condition, or data race sometimes also called hazard. But I have never had it referred to has career condition. Have you tried the easiest: adding omp parallel for to the first inner, and the second outer loop?

– Zulan
Nov 15 '18 at 15:19





So you mean race condition, or data race sometimes also called hazard. But I have never had it referred to has career condition. Have you tried the easiest: adding omp parallel for to the first inner, and the second outer loop?

– Zulan
Nov 15 '18 at 15:19













Yes, also have tried it. But I did not get expected result (it should be 300, but I get other nums). Oh, ok, I will update question with 'race conditions', thank you.

– JuMoGar
Nov 15 '18 at 15:31





Yes, also have tried it. But I did not get expected result (it should be 300, but I get other nums). Oh, ok, I will update question with 'race conditions', thank you.

– JuMoGar
Nov 15 '18 at 15:31












3 Answers
3






active

oldest

votes


















1














This is pretty straight forward



// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
#pragma omp parallel for
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


The inner loop is perfectly fine to parallelize as all writes happen to different elements of clustersCentroID. You can safely assume that the default schedule will not exhibit significant false sharing, it typically has large-enough chunks. Just don't try something like schedule(static,1).



The outer loop is not as easy to parallelize. You could either use a reduction on clusterMemberCount and clusterMemberCount, or do something like:



#pragma omp parallel // note NO for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
// ensure that exactly one thread works on each cluster
if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;


Only do this if the simple solution does not yield sufficient performance.



The other loop is simple as well



#pragma omp parallel for
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i];
}


Again, data access is perfectly isolated both in terms of correctness and, except for edge cases, false sharing.






share|improve this answer
























  • parallelizing the first outer loop with if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue; will distribute every other value of activeCluster to a different thread. This will result in some false sharing when writing on clusterAssignmentCurrent, and potentially also when writing on clustersCentroID if there are not that many dimensions (the edge effect will become significant if numCoords is small, i.e. it the exact case when parallelizing the inner loop is not efficient).

    – Brice
    Nov 19 '18 at 9:51













  • You are right, but I did not want to make assumptions about numCoords. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.

    – Zulan
    Nov 19 '18 at 12:31



















1














You should give an order of magnitude for the expected values of numCoords, numObjs and numClusters as the optimal way to parallelize depends on that. Especially, numCoords is important to see if parallelizing/vectorizing the inner loop over coordinates makes sense; like, are you taking of 3D coordinates or 1000 dimensions?



Another attempt with the drawback of an if statement in the first loop (bad for performance), static schedule (possible load unbalance) but each thread incrementing contiguous parts of clusterMemberCount and clustersCentroID which limits the risk of false sharing.



#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_num_threads() 1
#define omp_get_thread_num() 0
#endif


__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point

#pragma omp parallel
{
int nbOfThreads = omp_get_num_threads();
int thisThread = omp_get_thread_num();
// Schedule for the first step : process only cluster with ID in the [from , to[ range
int clustFrom = (thisThread*numClusters)/nbOfThreads;
int clustTo = (thisThread+1 == nbOfThreads) ? numClusters : ((thisThread+1)*numClusters)/nbOfThreads;

// Each thread will loop through all values of numObjs but only process them depending on activeCluster
// The loop is skipped only if the thread was assigned no cluster
if (clustTo>clustFrom){
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

if (activeCluster>=clustFrom && activeCluster<clustTo){
// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
}
}

#pragma omp barrier

// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
#pragma omp for // straightforward
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
}
free(clusterMemberCount);
}





share|improve this answer


























  • This is a good approach. The code snippet is mixing clustFrom/To and clusterLimits. You might want to consider numClusters < nbOfThreads.

    – Zulan
    Nov 19 '18 at 12:35











  • You are right about the cluster limits & from/to , I edited the answer. The code should run fine even if numClusters < nbOfThreads. Of course, in that case, all threads but the last one will have clustFrom==0 and clustTo==0 and would loop for no reason...

    – Brice
    Nov 19 '18 at 13:59











  • I moved a bit the parentheses in clustFrom & clustTo definition for better load balancing, and added an if to skip the first loop if there is nothing to process, rather then loop and hit an if(condition_always_false)

    – Brice
    Nov 19 '18 at 14:25



















-1














Adding to my comment: ++clusterMemberCount[activeCluster] forms a histogram and is problematic when two threads try to update same item (or adjacent item sharing a cache line). This needs to be taken out of the loop either as a sequential part, or must be parallelized by having a separate copy of the histogram for each thread, then combined.



You can easily separate this part from the first parallel loop.



// Make the histogram
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
++clusterMemberCount[activeCluster];
}


Then process everything exploiting parallelism



// parallel processing
#pragma omp parallel for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


The second time there's a potential false sharing is when numCoords * sizeof(clustersCentroID[0]) % 64 != 0 assuming 64-byte cache line. This can be mitigated by overallocating clustersCentroID to full multiples of 64 bytes.



// Loop for numCoords, but index by numCoordsX
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoordsX + j] += dataSetMatrix[i*numCoords + j];





share|improve this answer
























  • Then, assuming 64-byte cache line, numCoordsX = numCoods*64 ?

    – JuMoGar
    Nov 18 '18 at 17:27













  • float *clusterCentroID means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple of numCoords. But again, just as commented earlier by someone else, you should tell us the expected magnitude of numCoords and all the relevant variables. Eg. what is the maximum clusterMemberCount, what is the maximum activeCluster and so on.

    – Aki Suihkonen
    Nov 18 '18 at 18:17













  • There is a race condition in this code when writing at clustersCentroID. It should be somehow protected, the same way the increment of clusterMemberCounthas been protected against race conditions. As long as two threads can be processing points belonging to the same activeCluster, writing to an array indexed by activeCluster is not safe.

    – Brice
    Nov 19 '18 at 9:57











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53321311%2fhow-could-i-paralelize-this-function-with-openmp-without-race-conditions-nor-fal%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














This is pretty straight forward



// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
#pragma omp parallel for
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


The inner loop is perfectly fine to parallelize as all writes happen to different elements of clustersCentroID. You can safely assume that the default schedule will not exhibit significant false sharing, it typically has large-enough chunks. Just don't try something like schedule(static,1).



The outer loop is not as easy to parallelize. You could either use a reduction on clusterMemberCount and clusterMemberCount, or do something like:



#pragma omp parallel // note NO for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
// ensure that exactly one thread works on each cluster
if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;


Only do this if the simple solution does not yield sufficient performance.



The other loop is simple as well



#pragma omp parallel for
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i];
}


Again, data access is perfectly isolated both in terms of correctness and, except for edge cases, false sharing.






share|improve this answer
























  • parallelizing the first outer loop with if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue; will distribute every other value of activeCluster to a different thread. This will result in some false sharing when writing on clusterAssignmentCurrent, and potentially also when writing on clustersCentroID if there are not that many dimensions (the edge effect will become significant if numCoords is small, i.e. it the exact case when parallelizing the inner loop is not efficient).

    – Brice
    Nov 19 '18 at 9:51













  • You are right, but I did not want to make assumptions about numCoords. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.

    – Zulan
    Nov 19 '18 at 12:31
















1














This is pretty straight forward



// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
#pragma omp parallel for
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


The inner loop is perfectly fine to parallelize as all writes happen to different elements of clustersCentroID. You can safely assume that the default schedule will not exhibit significant false sharing, it typically has large-enough chunks. Just don't try something like schedule(static,1).



The outer loop is not as easy to parallelize. You could either use a reduction on clusterMemberCount and clusterMemberCount, or do something like:



#pragma omp parallel // note NO for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
// ensure that exactly one thread works on each cluster
if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;


Only do this if the simple solution does not yield sufficient performance.



The other loop is simple as well



#pragma omp parallel for
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i];
}


Again, data access is perfectly isolated both in terms of correctness and, except for edge cases, false sharing.






share|improve this answer
























  • parallelizing the first outer loop with if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue; will distribute every other value of activeCluster to a different thread. This will result in some false sharing when writing on clusterAssignmentCurrent, and potentially also when writing on clustersCentroID if there are not that many dimensions (the edge effect will become significant if numCoords is small, i.e. it the exact case when parallelizing the inner loop is not efficient).

    – Brice
    Nov 19 '18 at 9:51













  • You are right, but I did not want to make assumptions about numCoords. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.

    – Zulan
    Nov 19 '18 at 12:31














1












1








1







This is pretty straight forward



// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
#pragma omp parallel for
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


The inner loop is perfectly fine to parallelize as all writes happen to different elements of clustersCentroID. You can safely assume that the default schedule will not exhibit significant false sharing, it typically has large-enough chunks. Just don't try something like schedule(static,1).



The outer loop is not as easy to parallelize. You could either use a reduction on clusterMemberCount and clusterMemberCount, or do something like:



#pragma omp parallel // note NO for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
// ensure that exactly one thread works on each cluster
if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;


Only do this if the simple solution does not yield sufficient performance.



The other loop is simple as well



#pragma omp parallel for
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i];
}


Again, data access is perfectly isolated both in terms of correctness and, except for edge cases, false sharing.






share|improve this answer













This is pretty straight forward



// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
#pragma omp parallel for
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


The inner loop is perfectly fine to parallelize as all writes happen to different elements of clustersCentroID. You can safely assume that the default schedule will not exhibit significant false sharing, it typically has large-enough chunks. Just don't try something like schedule(static,1).



The outer loop is not as easy to parallelize. You could either use a reduction on clusterMemberCount and clusterMemberCount, or do something like:



#pragma omp parallel // note NO for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
// ensure that exactly one thread works on each cluster
if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue;


Only do this if the simple solution does not yield sufficient performance.



The other loop is simple as well



#pragma omp parallel for
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i];
}


Again, data access is perfectly isolated both in terms of correctness and, except for edge cases, false sharing.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 16 '18 at 9:03









ZulanZulan

15.4k63070




15.4k63070













  • parallelizing the first outer loop with if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue; will distribute every other value of activeCluster to a different thread. This will result in some false sharing when writing on clusterAssignmentCurrent, and potentially also when writing on clustersCentroID if there are not that many dimensions (the edge effect will become significant if numCoords is small, i.e. it the exact case when parallelizing the inner loop is not efficient).

    – Brice
    Nov 19 '18 at 9:51













  • You are right, but I did not want to make assumptions about numCoords. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.

    – Zulan
    Nov 19 '18 at 12:31



















  • parallelizing the first outer loop with if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue; will distribute every other value of activeCluster to a different thread. This will result in some false sharing when writing on clusterAssignmentCurrent, and potentially also when writing on clustersCentroID if there are not that many dimensions (the edge effect will become significant if numCoords is small, i.e. it the exact case when parallelizing the inner loop is not efficient).

    – Brice
    Nov 19 '18 at 9:51













  • You are right, but I did not want to make assumptions about numCoords. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.

    – Zulan
    Nov 19 '18 at 12:31

















parallelizing the first outer loop with if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue; will distribute every other value of activeCluster to a different thread. This will result in some false sharing when writing on clusterAssignmentCurrent, and potentially also when writing on clustersCentroID if there are not that many dimensions (the edge effect will become significant if numCoords is small, i.e. it the exact case when parallelizing the inner loop is not efficient).

– Brice
Nov 19 '18 at 9:51







parallelizing the first outer loop with if (activeCluster % omp_num_threads() != omp_get_thread_num()) continue; will distribute every other value of activeCluster to a different thread. This will result in some false sharing when writing on clusterAssignmentCurrent, and potentially also when writing on clustersCentroID if there are not that many dimensions (the edge effect will become significant if numCoords is small, i.e. it the exact case when parallelizing the inner loop is not efficient).

– Brice
Nov 19 '18 at 9:51















You are right, but I did not want to make assumptions about numCoords. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.

– Zulan
Nov 19 '18 at 12:31





You are right, but I did not want to make assumptions about numCoords. The simple approach is fine until you know more about the values and/or measurements show that it is too slow. I only try to sketch the direction for parallelizing the outer loop - reduction is usually advisable. You can use any sophisticated distribution of clusters to threads.

– Zulan
Nov 19 '18 at 12:31













1














You should give an order of magnitude for the expected values of numCoords, numObjs and numClusters as the optimal way to parallelize depends on that. Especially, numCoords is important to see if parallelizing/vectorizing the inner loop over coordinates makes sense; like, are you taking of 3D coordinates or 1000 dimensions?



Another attempt with the drawback of an if statement in the first loop (bad for performance), static schedule (possible load unbalance) but each thread incrementing contiguous parts of clusterMemberCount and clustersCentroID which limits the risk of false sharing.



#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_num_threads() 1
#define omp_get_thread_num() 0
#endif


__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point

#pragma omp parallel
{
int nbOfThreads = omp_get_num_threads();
int thisThread = omp_get_thread_num();
// Schedule for the first step : process only cluster with ID in the [from , to[ range
int clustFrom = (thisThread*numClusters)/nbOfThreads;
int clustTo = (thisThread+1 == nbOfThreads) ? numClusters : ((thisThread+1)*numClusters)/nbOfThreads;

// Each thread will loop through all values of numObjs but only process them depending on activeCluster
// The loop is skipped only if the thread was assigned no cluster
if (clustTo>clustFrom){
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

if (activeCluster>=clustFrom && activeCluster<clustTo){
// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
}
}

#pragma omp barrier

// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
#pragma omp for // straightforward
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
}
free(clusterMemberCount);
}





share|improve this answer


























  • This is a good approach. The code snippet is mixing clustFrom/To and clusterLimits. You might want to consider numClusters < nbOfThreads.

    – Zulan
    Nov 19 '18 at 12:35











  • You are right about the cluster limits & from/to , I edited the answer. The code should run fine even if numClusters < nbOfThreads. Of course, in that case, all threads but the last one will have clustFrom==0 and clustTo==0 and would loop for no reason...

    – Brice
    Nov 19 '18 at 13:59











  • I moved a bit the parentheses in clustFrom & clustTo definition for better load balancing, and added an if to skip the first loop if there is nothing to process, rather then loop and hit an if(condition_always_false)

    – Brice
    Nov 19 '18 at 14:25
















1














You should give an order of magnitude for the expected values of numCoords, numObjs and numClusters as the optimal way to parallelize depends on that. Especially, numCoords is important to see if parallelizing/vectorizing the inner loop over coordinates makes sense; like, are you taking of 3D coordinates or 1000 dimensions?



Another attempt with the drawback of an if statement in the first loop (bad for performance), static schedule (possible load unbalance) but each thread incrementing contiguous parts of clusterMemberCount and clustersCentroID which limits the risk of false sharing.



#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_num_threads() 1
#define omp_get_thread_num() 0
#endif


__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point

#pragma omp parallel
{
int nbOfThreads = omp_get_num_threads();
int thisThread = omp_get_thread_num();
// Schedule for the first step : process only cluster with ID in the [from , to[ range
int clustFrom = (thisThread*numClusters)/nbOfThreads;
int clustTo = (thisThread+1 == nbOfThreads) ? numClusters : ((thisThread+1)*numClusters)/nbOfThreads;

// Each thread will loop through all values of numObjs but only process them depending on activeCluster
// The loop is skipped only if the thread was assigned no cluster
if (clustTo>clustFrom){
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

if (activeCluster>=clustFrom && activeCluster<clustTo){
// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
}
}

#pragma omp barrier

// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
#pragma omp for // straightforward
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
}
free(clusterMemberCount);
}





share|improve this answer


























  • This is a good approach. The code snippet is mixing clustFrom/To and clusterLimits. You might want to consider numClusters < nbOfThreads.

    – Zulan
    Nov 19 '18 at 12:35











  • You are right about the cluster limits & from/to , I edited the answer. The code should run fine even if numClusters < nbOfThreads. Of course, in that case, all threads but the last one will have clustFrom==0 and clustTo==0 and would loop for no reason...

    – Brice
    Nov 19 '18 at 13:59











  • I moved a bit the parentheses in clustFrom & clustTo definition for better load balancing, and added an if to skip the first loop if there is nothing to process, rather then loop and hit an if(condition_always_false)

    – Brice
    Nov 19 '18 at 14:25














1












1








1







You should give an order of magnitude for the expected values of numCoords, numObjs and numClusters as the optimal way to parallelize depends on that. Especially, numCoords is important to see if parallelizing/vectorizing the inner loop over coordinates makes sense; like, are you taking of 3D coordinates or 1000 dimensions?



Another attempt with the drawback of an if statement in the first loop (bad for performance), static schedule (possible load unbalance) but each thread incrementing contiguous parts of clusterMemberCount and clustersCentroID which limits the risk of false sharing.



#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_num_threads() 1
#define omp_get_thread_num() 0
#endif


__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point

#pragma omp parallel
{
int nbOfThreads = omp_get_num_threads();
int thisThread = omp_get_thread_num();
// Schedule for the first step : process only cluster with ID in the [from , to[ range
int clustFrom = (thisThread*numClusters)/nbOfThreads;
int clustTo = (thisThread+1 == nbOfThreads) ? numClusters : ((thisThread+1)*numClusters)/nbOfThreads;

// Each thread will loop through all values of numObjs but only process them depending on activeCluster
// The loop is skipped only if the thread was assigned no cluster
if (clustTo>clustFrom){
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

if (activeCluster>=clustFrom && activeCluster<clustTo){
// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
}
}

#pragma omp barrier

// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
#pragma omp for // straightforward
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
}
free(clusterMemberCount);
}





share|improve this answer















You should give an order of magnitude for the expected values of numCoords, numObjs and numClusters as the optimal way to parallelize depends on that. Especially, numCoords is important to see if parallelizing/vectorizing the inner loop over coordinates makes sense; like, are you taking of 3D coordinates or 1000 dimensions?



Another attempt with the drawback of an if statement in the first loop (bad for performance), static schedule (possible load unbalance) but each thread incrementing contiguous parts of clusterMemberCount and clustersCentroID which limits the risk of false sharing.



#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_num_threads() 1
#define omp_get_thread_num() 0
#endif


__inline static
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(float));
// sum all points
// for every point

#pragma omp parallel
{
int nbOfThreads = omp_get_num_threads();
int thisThread = omp_get_thread_num();
// Schedule for the first step : process only cluster with ID in the [from , to[ range
int clustFrom = (thisThread*numClusters)/nbOfThreads;
int clustTo = (thisThread+1 == nbOfThreads) ? numClusters : ((thisThread+1)*numClusters)/nbOfThreads;

// Each thread will loop through all values of numObjs but only process them depending on activeCluster
// The loop is skipped only if the thread was assigned no cluster
if (clustTo>clustFrom){
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
int activeCluster = clusterAssignmentCurrent[i];

if (activeCluster>=clustFrom && activeCluster<clustTo){
// update count of members in that cluster
++clusterMemberCount[activeCluster];

// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
}
}

#pragma omp barrier

// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
#pragma omp for // straightforward
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount[i] != 0)
// for each coordinate
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
}
}
free(clusterMemberCount);
}






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 19 '18 at 14:21

























answered Nov 16 '18 at 9:54









BriceBrice

1,400110




1,400110













  • This is a good approach. The code snippet is mixing clustFrom/To and clusterLimits. You might want to consider numClusters < nbOfThreads.

    – Zulan
    Nov 19 '18 at 12:35











  • You are right about the cluster limits & from/to , I edited the answer. The code should run fine even if numClusters < nbOfThreads. Of course, in that case, all threads but the last one will have clustFrom==0 and clustTo==0 and would loop for no reason...

    – Brice
    Nov 19 '18 at 13:59











  • I moved a bit the parentheses in clustFrom & clustTo definition for better load balancing, and added an if to skip the first loop if there is nothing to process, rather then loop and hit an if(condition_always_false)

    – Brice
    Nov 19 '18 at 14:25



















  • This is a good approach. The code snippet is mixing clustFrom/To and clusterLimits. You might want to consider numClusters < nbOfThreads.

    – Zulan
    Nov 19 '18 at 12:35











  • You are right about the cluster limits & from/to , I edited the answer. The code should run fine even if numClusters < nbOfThreads. Of course, in that case, all threads but the last one will have clustFrom==0 and clustTo==0 and would loop for no reason...

    – Brice
    Nov 19 '18 at 13:59











  • I moved a bit the parentheses in clustFrom & clustTo definition for better load balancing, and added an if to skip the first loop if there is nothing to process, rather then loop and hit an if(condition_always_false)

    – Brice
    Nov 19 '18 at 14:25

















This is a good approach. The code snippet is mixing clustFrom/To and clusterLimits. You might want to consider numClusters < nbOfThreads.

– Zulan
Nov 19 '18 at 12:35





This is a good approach. The code snippet is mixing clustFrom/To and clusterLimits. You might want to consider numClusters < nbOfThreads.

– Zulan
Nov 19 '18 at 12:35













You are right about the cluster limits & from/to , I edited the answer. The code should run fine even if numClusters < nbOfThreads. Of course, in that case, all threads but the last one will have clustFrom==0 and clustTo==0 and would loop for no reason...

– Brice
Nov 19 '18 at 13:59





You are right about the cluster limits & from/to , I edited the answer. The code should run fine even if numClusters < nbOfThreads. Of course, in that case, all threads but the last one will have clustFrom==0 and clustTo==0 and would loop for no reason...

– Brice
Nov 19 '18 at 13:59













I moved a bit the parentheses in clustFrom & clustTo definition for better load balancing, and added an if to skip the first loop if there is nothing to process, rather then loop and hit an if(condition_always_false)

– Brice
Nov 19 '18 at 14:25





I moved a bit the parentheses in clustFrom & clustTo definition for better load balancing, and added an if to skip the first loop if there is nothing to process, rather then loop and hit an if(condition_always_false)

– Brice
Nov 19 '18 at 14:25











-1














Adding to my comment: ++clusterMemberCount[activeCluster] forms a histogram and is problematic when two threads try to update same item (or adjacent item sharing a cache line). This needs to be taken out of the loop either as a sequential part, or must be parallelized by having a separate copy of the histogram for each thread, then combined.



You can easily separate this part from the first parallel loop.



// Make the histogram
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
++clusterMemberCount[activeCluster];
}


Then process everything exploiting parallelism



// parallel processing
#pragma omp parallel for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


The second time there's a potential false sharing is when numCoords * sizeof(clustersCentroID[0]) % 64 != 0 assuming 64-byte cache line. This can be mitigated by overallocating clustersCentroID to full multiples of 64 bytes.



// Loop for numCoords, but index by numCoordsX
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoordsX + j] += dataSetMatrix[i*numCoords + j];





share|improve this answer
























  • Then, assuming 64-byte cache line, numCoordsX = numCoods*64 ?

    – JuMoGar
    Nov 18 '18 at 17:27













  • float *clusterCentroID means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple of numCoords. But again, just as commented earlier by someone else, you should tell us the expected magnitude of numCoords and all the relevant variables. Eg. what is the maximum clusterMemberCount, what is the maximum activeCluster and so on.

    – Aki Suihkonen
    Nov 18 '18 at 18:17













  • There is a race condition in this code when writing at clustersCentroID. It should be somehow protected, the same way the increment of clusterMemberCounthas been protected against race conditions. As long as two threads can be processing points belonging to the same activeCluster, writing to an array indexed by activeCluster is not safe.

    – Brice
    Nov 19 '18 at 9:57
















-1














Adding to my comment: ++clusterMemberCount[activeCluster] forms a histogram and is problematic when two threads try to update same item (or adjacent item sharing a cache line). This needs to be taken out of the loop either as a sequential part, or must be parallelized by having a separate copy of the histogram for each thread, then combined.



You can easily separate this part from the first parallel loop.



// Make the histogram
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
++clusterMemberCount[activeCluster];
}


Then process everything exploiting parallelism



// parallel processing
#pragma omp parallel for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


The second time there's a potential false sharing is when numCoords * sizeof(clustersCentroID[0]) % 64 != 0 assuming 64-byte cache line. This can be mitigated by overallocating clustersCentroID to full multiples of 64 bytes.



// Loop for numCoords, but index by numCoordsX
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoordsX + j] += dataSetMatrix[i*numCoords + j];





share|improve this answer
























  • Then, assuming 64-byte cache line, numCoordsX = numCoods*64 ?

    – JuMoGar
    Nov 18 '18 at 17:27













  • float *clusterCentroID means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple of numCoords. But again, just as commented earlier by someone else, you should tell us the expected magnitude of numCoords and all the relevant variables. Eg. what is the maximum clusterMemberCount, what is the maximum activeCluster and so on.

    – Aki Suihkonen
    Nov 18 '18 at 18:17













  • There is a race condition in this code when writing at clustersCentroID. It should be somehow protected, the same way the increment of clusterMemberCounthas been protected against race conditions. As long as two threads can be processing points belonging to the same activeCluster, writing to an array indexed by activeCluster is not safe.

    – Brice
    Nov 19 '18 at 9:57














-1












-1








-1







Adding to my comment: ++clusterMemberCount[activeCluster] forms a histogram and is problematic when two threads try to update same item (or adjacent item sharing a cache line). This needs to be taken out of the loop either as a sequential part, or must be parallelized by having a separate copy of the histogram for each thread, then combined.



You can easily separate this part from the first parallel loop.



// Make the histogram
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
++clusterMemberCount[activeCluster];
}


Then process everything exploiting parallelism



// parallel processing
#pragma omp parallel for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


The second time there's a potential false sharing is when numCoords * sizeof(clustersCentroID[0]) % 64 != 0 assuming 64-byte cache line. This can be mitigated by overallocating clustersCentroID to full multiples of 64 bytes.



// Loop for numCoords, but index by numCoordsX
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoordsX + j] += dataSetMatrix[i*numCoords + j];





share|improve this answer













Adding to my comment: ++clusterMemberCount[activeCluster] forms a histogram and is problematic when two threads try to update same item (or adjacent item sharing a cache line). This needs to be taken out of the loop either as a sequential part, or must be parallelized by having a separate copy of the histogram for each thread, then combined.



You can easily separate this part from the first parallel loop.



// Make the histogram
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
++clusterMemberCount[activeCluster];
}


Then process everything exploiting parallelism



// parallel processing
#pragma omp parallel for
for (int i = 0; i < numObjs; ++i) {
int activeCluster = clusterAssignmentCurrent[i];
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}


The second time there's a potential false sharing is when numCoords * sizeof(clustersCentroID[0]) % 64 != 0 assuming 64-byte cache line. This can be mitigated by overallocating clustersCentroID to full multiples of 64 bytes.



// Loop for numCoords, but index by numCoordsX
for (int j = 0; j < numCoords; ++j)
clustersCentroID[activeCluster*numCoordsX + j] += dataSetMatrix[i*numCoords + j];






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 17 '18 at 10:17









Aki SuihkonenAki Suihkonen

14.3k12245




14.3k12245













  • Then, assuming 64-byte cache line, numCoordsX = numCoods*64 ?

    – JuMoGar
    Nov 18 '18 at 17:27













  • float *clusterCentroID means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple of numCoords. But again, just as commented earlier by someone else, you should tell us the expected magnitude of numCoords and all the relevant variables. Eg. what is the maximum clusterMemberCount, what is the maximum activeCluster and so on.

    – Aki Suihkonen
    Nov 18 '18 at 18:17













  • There is a race condition in this code when writing at clustersCentroID. It should be somehow protected, the same way the increment of clusterMemberCounthas been protected against race conditions. As long as two threads can be processing points belonging to the same activeCluster, writing to an array indexed by activeCluster is not safe.

    – Brice
    Nov 19 '18 at 9:57



















  • Then, assuming 64-byte cache line, numCoordsX = numCoods*64 ?

    – JuMoGar
    Nov 18 '18 at 17:27













  • float *clusterCentroID means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple of numCoords. But again, just as commented earlier by someone else, you should tell us the expected magnitude of numCoords and all the relevant variables. Eg. what is the maximum clusterMemberCount, what is the maximum activeCluster and so on.

    – Aki Suihkonen
    Nov 18 '18 at 18:17













  • There is a race condition in this code when writing at clustersCentroID. It should be somehow protected, the same way the increment of clusterMemberCounthas been protected against race conditions. As long as two threads can be processing points belonging to the same activeCluster, writing to an array indexed by activeCluster is not safe.

    – Brice
    Nov 19 '18 at 9:57

















Then, assuming 64-byte cache line, numCoordsX = numCoods*64 ?

– JuMoGar
Nov 18 '18 at 17:27







Then, assuming 64-byte cache line, numCoordsX = numCoods*64 ?

– JuMoGar
Nov 18 '18 at 17:27















float *clusterCentroID means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple of numCoords. But again, just as commented earlier by someone else, you should tell us the expected magnitude of numCoords and all the relevant variables. Eg. what is the maximum clusterMemberCount, what is the maximum activeCluster and so on.

– Aki Suihkonen
Nov 18 '18 at 18:17







float *clusterCentroID means, that each element takes 4 bytes, thus numCoordsX must be a multiple of 16. Which doesn't mean it has to be any multiple of numCoords. But again, just as commented earlier by someone else, you should tell us the expected magnitude of numCoords and all the relevant variables. Eg. what is the maximum clusterMemberCount, what is the maximum activeCluster and so on.

– Aki Suihkonen
Nov 18 '18 at 18:17















There is a race condition in this code when writing at clustersCentroID. It should be somehow protected, the same way the increment of clusterMemberCounthas been protected against race conditions. As long as two threads can be processing points belonging to the same activeCluster, writing to an array indexed by activeCluster is not safe.

– Brice
Nov 19 '18 at 9:57





There is a race condition in this code when writing at clustersCentroID. It should be somehow protected, the same way the increment of clusterMemberCounthas been protected against race conditions. As long as two threads can be processing points belonging to the same activeCluster, writing to an array indexed by activeCluster is not safe.

– Brice
Nov 19 '18 at 9:57


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53321311%2fhow-could-i-paralelize-this-function-with-openmp-without-race-conditions-nor-fal%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Hercules Kyvelos

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud