20210626, 19:46  #1 
(loop (#_fork))
Feb 2006
Cambridge, England
2×3×29×37 Posts 
Profiledriven optimisation for lanczos
Has anyone got thoughts about the NxB times BxB multiply once VBITS grows big?
My profile at the moment looks like Code:
26.39% msieveMPV256 msieveMPV256BDW [.] mul_BxN_NxB 26.09% msieveMPV256 msieveMPV256BDW [.] mul_trans_packed_core 21.01% msieveMPV256 msieveMPV256BDW [.] mul_packed_core 17.99% msieveMPV256 msieveMPV256BDW [.] core_NxB_BxB_acc 5.58% msieveMPV256 msieveMPV256BDW [.] mul_packed_small_core Code:
37.72% msieveMPV128 msieveMPV128HSW [.] mul_trans_packed_core 30.40% msieveMPV128 msieveMPV128HSW [.] mul_packed_core 14.74% msieveMPV128 msieveMPV128HSW [.] mul_BxN_NxB 7.40% msieveMPV128 msieveMPV128HSW [.] core_NxB_BxB_acc 6.50% msieveMPV128 msieveMPV128HSW [.] mul_packed_small_core The matrix itself is VBITS^2 bits, which is 8kB and fits nicely in L1 for VBITS=256, but is already getting inconvenient for VBITS=512; I wonder how much slower it is to do eight times as many accesses, in a perfectly uniform pattern so no address computation, to a table of 1/32 the size? I suppose I should also try VBITS/2 tables of 4*VBITS and VBITS/4 tables of 16*VBITS, which are 16k and 64k. Last fiddled with by fivemack on 20210627 at 19:16 
20210626, 22:23  #2 
(loop (#_fork))
Feb 2006
Cambridge, England
2×3×29×37 Posts 
This is driving me slightly mad: I made the obvious tweaks to use a 2bit rather than 8bit table, it slowed down immensely, and this was because it wasn't using YMM operations at all, rather carrying around the vector in four 64bit registers.
When I recoded to use absolutely explicit YMM operations: Code:
zero=_mm256_xor_si256(zero,zero); _mm256_zeroupper(); for (i = 0; i < n; i++) { accum = zero; jig = 0; for (j=0; j<VWORDS; j++) { uint64 g = v[i].w[j]; for (k=0; k<64; k+=2) { accum = _mm256_xor_si256(accum,*(__m256i*)(&(c[jig+(g&3)]))); jig+=4; g>>=2; } } __m256i* yy = (__m256i*)(&(y[i])); *yy = _mm256_xor_si256(*yy,accum); } VPXOR xmm0,xmm0,xmm0 (note the "x") from the intrinsic Code:
accum = _mm256_xor_si256(accum,accum); Last fiddled with by fivemack on 20210626 at 22:23 
20210627, 09:35  #3 
(loop (#_fork))
Feb 2006
Cambridge, England
14446_{8} Posts 
I am just downloading gcc11.1 in the hope that it will be equipped with a sufficiently heavyduty yak razor

20210627, 13:05  #4 
(loop (#_fork))
Feb 2006
Cambridge, England
2×3×29×37 Posts 
Well, that's not actually the cause; the vpxor xmm0,xmm0,xmm0 instruction is in fact defined as clearing the top half of the register (whilst the superficially similar pxor xmm0,xmm0 is defined as leaving the top half intact and causing falsesharing issues).
Last fiddled with by fivemack on 20210627 at 13:06 
20210702, 17:05  #5 
Tribal Bullet
Oct 2004
6727_{8} Posts 
If the working set is too large when biting off 8bit chunks then another option is to use a larger number of 6bit chunks. With word size W bits and chunk size C bits the table will have (W/8) * (W/C) * 2^C bytes.
The msievelacuda branch uses C=6 and it does make GPU runs faster, but the stock version uses W=64 only. Greg is spending time generalizing to larger word size and folding in MPI so that multiple GPUs can combine together. We are also going very carefully through the BxN * NxB vectorvector operation, which is a playground of fun to implement in CUDA. Last fiddled with by jasonp on 20210702 at 17:07 
20210702, 19:29  #6 
Jul 2003
So Cal
2^{2}×547 Posts 
I reimplemented the NxB_BxB CUDA kernel to bite off 2 bits at a time and make 4 (or actually 3 since the 00 table isn't needed) arrays in GPU shared memory rather than doing this on the cpu and uploading the result to the gpu. This gave a 35x speedup depending on the GPU.
https://github.com/gchilders/msieve_...czos_kernel.cu Other than merging the nonlacuda branch changes back into the lacuda branch and testing the CUDAaware MPI stuff once I have access to a cluster that supports it, I think I'm mostly done. CUDA and CUDA+MPI both work. As a test I used two (now ancient) Tesla K20 gpus to solve a 3.36M x 3.36M matrix in 2h22m. Once I have access to them again, probably by Tuesday, I'll test it out on a couple of V100's. In the CPU code, the explicit unrolling generally needs to be removed. GCC 10+ don't autovectorize the unrolled loops but are good about detecting and vectorizing the rolled versions. This really helps for ARM SVE. Also, on ARM SVE, adding the option msvevectorbits=512 helps significantly. I'm not sure if there's an equivalent option for AVX2 or AVX 512. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
CRT optimisation?  mickfrancis  Math  9  20160330 10:20 
Fastest you've driven a car?  Oddball  Lounge  43  20110314 00:26 
Hardware Profile  storm5510  Hardware  6  20090819 13:05 
Avatars and profile pictures...  Xyzzy  Lounge  3  20050712 23:12 
Incorrect CPU profile in 22.9 and 22.8  sdbardwick  Software  5  20020922 19:49 