How can I use openmp and AVX2 simultaneously with perfect answer?

Multi tool use
How can I use openmp and AVX2 simultaneously with perfect answer?
I wrote the Matrix-Vector product program using OpenMP and AVX2.
However, I got the wrong answer because of OpenMP.
The true answer is all of the value of array c would become 100.
My answer was mix of 98, 99, and 100.
The actual code is below.
I compiled Clang with -fopenmp, -mavx, -mfma.
#include "stdio.h"
#include "math.h"
#include "stdlib.h"
#include "omp.h"
#include "x86intrin.h"
void mv(double *a,double *b,double *c, int m, int n, int l)
{
int k;
#pragma omp parallel
{
__m256d va,vb,vc;
int i;
#pragma omp for private(i, va, vb, vc) schedule(static)
for (k = 0; k < l; k++) {
vb = _mm256_broadcast_sd(&b[k]);
for (i = 0; i < m; i+=4) {
va = _mm256_loadu_pd(&a[m*k+i]);
vc = _mm256_loadu_pd(&c[i]);
vc = _mm256_fmadd_pd(vc, va, vb);
_mm256_storeu_pd( &c[i], vc );
}
}
}
}
int main(int argc, char* argv) {
// set variables
int m;
double* a;
double* b;
double* c;
int i;
m=100;
// main program
// set vector or matrix
a=(double *)malloc(sizeof(double) * m*m);
b=(double *)malloc(sizeof(double) * m*1);
c=(double *)malloc(sizeof(double) * m*1);
//preset
for (i=0;i<m;i++) {
a[i]=1;
b[i]=1;
c[i]=0.0;
}
for (i=m;i<m*m;i++) {
a[i]=1;
}
mv(a, b, c, m, 1, m);
for (i=0;i<m;i++) {
printf("%en", c[i]);
}
free(a);
free(b);
free(c);
return 0;
}
I know critical section would help. However critical section was slow.
So, how can I solve the problem?
-mavx2
-march=native
Did you try declaring your variables inside the inner-most scope, so each loop iteration has its own
__m256d va = _mm256_loadu_pd(&a[m*k+i]);
, and its own i
with for (int i = 0 ; ...)
? IDK OpenMP well, so IDK if that will make a difference, but it seems totally unnecessary to declare any of those variables separately.– Peter Cordes
Jul 1 at 3:44
__m256d va = _mm256_loadu_pd(&a[m*k+i]);
i
for (int i = 0 ; ...)
I guess you are using column-major order (en.wikipedia.org/wiki/Row-major_order)? That's normal in Fortran but not C.
– Z boson
Jul 2 at 9:25
2 Answers
2
The issue is not with your AVX intrinsics, let's look at the code without the intrinsics for a minute:
void mv(double *a,double *b,double *c, int m, int n, int l)
{
#pragma omp parallel for schedule(static)
for (int k = 0; k < l; k++) {
double xb = b[k];
for (int i = 0; i < m; i++) {
double xa = a[m*k+i];
double xc = c[i];
xc = xc + xa * xb;
c[i] = xc;
}
}
}
Note: your private declaration was technically correct and redundant because declared inside of the parallel loop, but it is just so much easier to reason about the code if you declare the variables as locally as possible.
The race condition on your code is on c[i]
- which multiple threads try to update. Now even if you could protect that with say an atomic update, the performance would be horrible: Not only because of the protection, but because the data of c[i]
has to be constantly shifted around between caches of different cores.
c[i]
c[i]
One thing you can do about this is to use an array reduction on c
. This makes a private copy of c
for each thread and they get merged at the end:
c
c
void mv(double *a,double *b,double *c, int m, int n, int l)
{
#pragma omp parallel for schedule(static) reduction(+:c[:m])
for (int k = 0; k < l; k++) {
for (int i = 0; i < m; i++) {
c[i] += a[m*k+i] * b[k];
}
}
}
This should be reasonably efficient as long as two m
-vectors fit in your cache but you still may get a lot of overhead due to thread management overhead. Eventually you will be limited by memory bandwidth because in a vector-matrix multiplication you only have one computation per element read from a
.
m
a
Anyway, you can of course swap i
and k
loops and save the reduction, but then your memory access pattern on a
will be inefficient (strided) - so you should block the loop to avoid that.
i
k
a
Now if you look at the output of any modern compiler, it will generate SIMD code on its own. Of course you can apply your own SIMD intrinsics if you want to. But make sure that you handle the edge cases correctly if m
is not divisible by 4 (you did not in your original version).
m
At the end of the day, if you really want performance - use the functions from a BLAS library (e.g. MKL). If you want to play around with optimization, there are ample of opportunities to go in deep details.
Is the OP's math correct? Shouldn't it be be
c[i] += a[m*i+k]*b[k]
?– Z boson
Jul 2 at 9:01
c[i] += a[m*i+k]*b[k]
I mean like this godbolt.org/g/cPNm2j
– Z boson
Jul 2 at 9:05
Good catch, I haven't checked the indexing. Could be a column-major matrix I guess.
– Zulan
Jul 2 at 9:10
The OP also does
vc = vc*va + vb
. That does not make sense either. I think the OP wants something like vc = va*vb + vc
.– Z boson
Jul 2 at 9:13
vc = vc*va + vb
vc = va*vb + vc
Okay, I think you are right. Column-major storage could explain it.
– Z boson
Jul 2 at 9:18
The fundamental operation you want is
c[i] = a[i,k]*b[k]
If you use row-major order storage this becomes
c[i] = a[i*l + k]*b[k]
If you use column-major order storage this becomes
c[i] = a[k*m + i]*b[k]
For row-major order you can parallelize like this
#pragma omp parallel for
for(int i=0; i<m; i++) {
for(int k=0; k<l; k++) {
c[i] += a[i*l+k]*b[k];
}
}
For column-major order you can parallelize like this
#pragma omp parallel
for(int k=0; k<l; k++) {
#pragma omp for
for(int i=0; i<m; i++) {
c[i] += a[k*m+i]*b[k];
}
}
Matrix-vector operations are Level 2 operations which are memory bandwidth bound operation. The Level 1 and Level 2 operations don't scale e.g with the number of cores. It's only the Level 3 operations (e.g. dense matrix multiplication) which scale https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_3.
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Not that it matters, but why did you leave out
-mavx2
? Anyway, you should use-march=native
to enable everything your CPU can use, and more importantly to tune for that CPU instead of still tuning for the generic baseline CPU.– Peter Cordes
Jul 1 at 3:40