- A+
在上一篇文章里,我们讲解了“Bgr24彩色位图转为Gray8灰度位图”算法。本文将探讨“Bgr24彩色位图转为灰度的Bgr24位图”。区别在于目标位图也是Bgr24格式的,只是将像素数据由彩色转为了灰度。这些算法也是跨平台的,同一份源代码,能在 X86及Arm架构上运行,且均享有SIMD硬件加速。
一、标量算法
1.1 算法实现
算法原理与上一篇文章是一样,唯一区别是目标位图的地址计算与写入处理。因为现在对于每一个像素,需要写入3个字节。
源代码如下。
public static unsafe void ScalarDoBatch(byte* pSrc, int strideSrc, int width, int height, byte* pDst, int strideDst) { const int cbPixel = 3; // Bgr24 const int shiftPoint = 16; const int mulPoint = 1 << shiftPoint; // 0x10000 const int mulRed = (int)(0.299 * mulPoint + 0.5); // 19595 const int mulGreen = (int)(0.587 * mulPoint + 0.5); // 38470 const int mulBlue = mulPoint - mulRed - mulGreen; // 7471 byte* pRow = pSrc; byte* qRow = pDst; for (int i = 0; i < height; i++) { byte* p = pRow; byte* q = qRow; for (int j = 0; j < width; j++) { byte gray = (byte)((p[2] * mulRed + p[1] * mulGreen + p[0] * mulBlue) >> shiftPoint); q[0] = q[1] = q[2] = gray; p += cbPixel; // Bgr24 q += cbPixel; // Bgr24 store grayscale. } pRow += strideSrc; qRow += strideDst; } }
1.2 基准测试代码
使用 BenchmarkDotNet 进行基准测试。
可以使用上一篇文章的公共函数,写好标量算法的基准测试代码。源代码如下。
[Benchmark(Baseline = true)] public void Scalar() { ScalarDo(_sourceBitmapData, _destinationBitmapData, 0); } [Benchmark] public void ScalarParallel() { ScalarDo(_sourceBitmapData, _destinationBitmapData, 1); } public static unsafe void ScalarDo(BitmapData src, BitmapData dst, int parallelFactor = 0) { int width = src.Width; int height = src.Height; int strideSrc = src.Stride; int strideDst = dst.Stride; byte* pSrc = (byte*)src.Scan0.ToPointer(); byte* pDst = (byte*)dst.Scan0.ToPointer(); int processorCount = Environment.ProcessorCount; int batchSize = 0; if (parallelFactor > 1) { batchSize = height / (processorCount * parallelFactor); } else if (parallelFactor == 1) { if (height >= processorCount) batchSize = 1; } bool allowParallel = (batchSize > 0) && (processorCount > 1); if (allowParallel) { int batchCount = (height + batchSize - 1) / batchSize; // ceil((double)length / batchSize) Parallel.For(0, batchCount, i => { int start = batchSize * i; int len = batchSize; if (start + len > height) len = height - start; byte* pSrc2 = pSrc + start * strideSrc; byte* pDst2 = pDst + start * strideDst; ScalarDoBatch(pSrc2, strideSrc, width, len, pDst2, strideDst); }); } else { ScalarDoBatch(pSrc, strideSrc, width, height, pDst, strideDst); } }
二、向量算法
2.1 算法思路
对于24位转8位灰度,可以使用这种办法: 每次从源位图读取3个向量,进行3-元素组的解交织运算,得到 R,G,B 平面数据。随后使用向量化的乘法与加法,来计算灰度值。最后将存储了灰度值的那一个向量,进行3-元素组的交织运算,便能存储到目标位图。
它与“Bgr24彩色位图转为Gray8灰度位图”向量算法的区别,在于最后需做“3-元素组的交织运算”。
例如 Sse指令集使用的是128位向量,此时1个向量为16字节。每次从源位图读取3个向量,就是读取了48字节,即16个RGB像素。最后将灰度向量做“3-元素组的交织运算”,结果是3个向量。将那3个向量存储到目标位图,就是写入了48字节,即16个RGB像素。
对于3-元素组的交织,可以使用 shuffle 类别的指令来实现。例如对于X86架构的 128位向量,可以使用 SSSE3 的 _mm_shuffle_epi8 指令,它对应 NET 中的 Ssse3.Shuffle
方法。源代码如下。
static readonly Vector128<byte> YGroup3Zip_Shuffle_Byte_X_Part0 = Vector128.Create((sbyte)0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5).AsByte(); static readonly Vector128<byte> YGroup3Zip_Shuffle_Byte_X_Part1 = Vector128.Create((sbyte)-1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1).AsByte(); static readonly Vector128<byte> YGroup3Zip_Shuffle_Byte_X_Part2 = Vector128.Create((sbyte)-1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1).AsByte(); static readonly Vector128<byte> YGroup3Zip_Shuffle_Byte_Y_Part0 = Vector128.Create((sbyte)-1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1).AsByte(); static readonly Vector128<byte> YGroup3Zip_Shuffle_Byte_Y_Part1 = Vector128.Create((sbyte)5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10).AsByte(); static readonly Vector128<byte> YGroup3Zip_Shuffle_Byte_Y_Part2 = Vector128.Create((sbyte)-1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1).AsByte(); static readonly Vector128<byte> YGroup3Zip_Shuffle_Byte_Z_Part0 = Vector128.Create((sbyte)-1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1, -1).AsByte(); static readonly Vector128<byte> YGroup3Zip_Shuffle_Byte_Z_Part1 = Vector128.Create((sbyte)-1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1).AsByte(); static readonly Vector128<byte> YGroup3Zip_Shuffle_Byte_Z_Part2 = Vector128.Create((sbyte)10, -1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15).AsByte(); public static Vector128<byte> YGroup3Zip_Shuffle(Vector128<byte> x, Vector128<byte> y, Vector128<byte> z, out Vector128<byte> data1, out Vector128<byte> data2) { var f0A = YGroup3Zip_Shuffle_Byte_X_Part0; var f0B = YGroup3Zip_Shuffle_Byte_X_Part1; var f0C = YGroup3Zip_Shuffle_Byte_X_Part2; var f1A = YGroup3Zip_Shuffle_Byte_Y_Part0; var f1B = YGroup3Zip_Shuffle_Byte_Y_Part1; var f1C = YGroup3Zip_Shuffle_Byte_Y_Part2; var f2A = YGroup3Zip_Shuffle_Byte_Z_Part0; var f2B = YGroup3Zip_Shuffle_Byte_Z_Part1; var f2C = YGroup3Zip_Shuffle_Byte_Z_Part2; var rt0 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(x, f0A), Ssse3.Shuffle(y, f0B)), Ssse3.Shuffle(z, f0C)); var rt1 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(x, f1A), Ssse3.Shuffle(y, f1B)), Ssse3.Shuffle(z, f1C)); var rt2 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(x, f2A), Ssse3.Shuffle(y, f2B)), Ssse3.Shuffle(z, f2C)); data1 = rt1; data2 = rt2; return rt0; }
VectorTraits 库已经集成了上述算法,提供了“Vectors.YGroup3Zip”方法。该方法能够跨平台,它会使用各个平台的shuffle指令。
2.2 算法实现
有了 YGroup3Unzip、YGroup3Zip 方法后,便能方便的编写彩色转灰度的算法了。灰度系数有8位精度,于是需要将 8位数据变宽为16位后,再来计算乘法与加法。最后再将 16位数据,变窄为8位。源代码如下。
public static unsafe void UseVectorsDoBatch(byte* pSrc, int strideSrc, int width, int height, byte* pDst, int strideDst) { const int cbPixel = 3; // Bgr24 const int shiftPoint = 8; const int mulPoint = 1 << shiftPoint; // 0x100 const ushort mulRed = (ushort)(0.299 * mulPoint + 0.5); // 77 const ushort mulGreen = (ushort)(0.587 * mulPoint + 0.5); // 150 const ushort mulBlue = mulPoint - mulRed - mulGreen; // 29 Vector<ushort> vmulRed = new Vector<ushort>(mulRed); Vector<ushort> vmulGreen = new Vector<ushort>(mulGreen); Vector<ushort> vmulBlue = new Vector<ushort>(mulBlue); int vectorWidth = Vector<byte>.Count; int maxX = width - vectorWidth; byte* pRow = pSrc; byte* qRow = pDst; for (int i = 0; i < height; i++) { Vector<byte>* pLast = (Vector<byte>*)(pRow + maxX * cbPixel); // Bgr24 Vector<byte>* qLast = (Vector<byte>*)(qRow + maxX * cbPixel); // Bgr24 store grayscale. Vector<byte>* p = (Vector<byte>*)pRow; Vector<byte>* q = (Vector<byte>*)qRow; for (; ; ) { Vector<byte> r, g, b, gray, gray0, gray1, gray2; Vector<ushort> wr0, wr1, wg0, wg1, wb0, wb1; // Load. b = Vectors.YGroup3Unzip(p[0], p[1], p[2], out g, out r); // widen(r) * mulRed + widen(g) * mulGreen + widen(b) * mulBlue Vector.Widen(r, out wr0, out wr1); Vector.Widen(g, out wg0, out wg1); Vector.Widen(b, out wb0, out wb1); wr0 = Vectors.Multiply(wr0, vmulRed); wr1 = Vectors.Multiply(wr1, vmulRed); wg0 = Vectors.Multiply(wg0, vmulGreen); wg1 = Vectors.Multiply(wg1, vmulGreen); wb0 = Vectors.Multiply(wb0, vmulBlue); wb1 = Vectors.Multiply(wb1, vmulBlue); wr0 = Vector.Add(wr0, wg0); wr1 = Vector.Add(wr1, wg1); wr0 = Vector.Add(wr0, wb0); wr1 = Vector.Add(wr1, wb1); // Shift right and narrow. wr0 = Vectors.ShiftRightLogical_Const(wr0, shiftPoint); wr1 = Vectors.ShiftRightLogical_Const(wr1, shiftPoint); gray = Vector.Narrow(wr0, wr1); // Store. gray0 = Vectors.YGroup3Zip(gray, gray, gray, out gray1, out gray2); q[0] = gray0; q[1] = gray1; q[2] = gray2; // Next. if (p >= pLast) break; p += cbPixel; q += cbPixel; if (p > pLast) p = pLast; // The last block is also use vector. if (q > qLast) q = qLast; } pRow += strideSrc; qRow += strideDst; } }
2.3 基准测试代码
随后为该算法编写基准测试代码。
[Benchmark] public void UseVectors() { UseVectorsDo(_sourceBitmapData, _destinationBitmapData, 0); } [Benchmark] public void UseVectorsParallel() { UseVectorsDo(_sourceBitmapData, _destinationBitmapData, 1); } public static unsafe void UseVectorsDo(BitmapData src, BitmapData dst, int parallelFactor = 0) { int vectorWidth = Vector<byte>.Count; int width = src.Width; int height = src.Height; if (width <= vectorWidth) { ScalarDo(src, dst, parallelFactor); return; } int strideSrc = src.Stride; int strideDst = dst.Stride; byte* pSrc = (byte*)src.Scan0.ToPointer(); byte* pDst = (byte*)dst.Scan0.ToPointer(); int processorCount = Environment.ProcessorCount; int batchSize = 0; if (parallelFactor > 1) { batchSize = height / (processorCount * parallelFactor); } else if (parallelFactor == 1) { if (height >= processorCount) batchSize = 1; } bool allowParallel = (batchSize > 0) && (processorCount > 1); if (allowParallel) { int batchCount = (height + batchSize - 1) / batchSize; // ceil((double)length / batchSize) Parallel.For(0, batchCount, i => { int start = batchSize * i; int len = batchSize; if (start + len > height) len = height - start; byte* pSrc2 = pSrc + start * strideSrc; byte* pDst2 = pDst + start * strideDst; UseVectorsDoBatch(pSrc2, strideSrc, width, len, pDst2, strideDst); }); } else { UseVectorsDoBatch(pSrc, strideSrc, width, height, pDst, strideDst); } }
完整源码在 Bgr24ToGrayBgr24Benchmark.cs
三、基准测试结果
3.1 X86 架构
X86架构下的基准测试结果如下。
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3) AMD Ryzen 7 7840H w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores .NET SDK 8.0.403 [Host] : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI DefaultJob : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI | Method | Width | Mean | Error | StdDev | Ratio | |--------------------- |------ |-------------:|-----------:|-----------:|------:| | Scalar | 1024 | 1,128.81 us | 4.436 us | 3.932 us | 1.00 | | ScalarParallel | 1024 | 157.96 us | 1.007 us | 0.942 us | 0.14 | | UseVectors | 1024 | 123.79 us | 1.144 us | 1.014 us | 0.11 | | UseVectorsParallel | 1024 | 26.05 us | 0.503 us | 0.471 us | 0.02 | | | | | | | | | Scalar | 2048 | 4,279.99 us | 37.658 us | 35.226 us | 1.00 | | ScalarParallel | 2048 | 622.01 us | 3.989 us | 3.537 us | 0.15 | | UseVectors | 2048 | 631.53 us | 6.741 us | 6.305 us | 0.15 | | UseVectorsParallel | 2048 | 330.47 us | 5.479 us | 4.857 us | 0.08 | | | | | | | | | Scalar | 4096 | 17,252.90 us | 106.215 us | 99.353 us | 1.00 | | ScalarParallel | 4096 | 3,743.78 us | 25.989 us | 24.310 us | 0.22 | | UseVectors | 4096 | 3,273.92 us | 32.645 us | 30.537 us | 0.19 | | UseVectorsParallel | 4096 | 3,746.83 us | 11.083 us | 9.255 us | 0.22 |
- Scalar: 标量算法。
- ScalarParallel: 并发的标量算法。
- UseVectors: 矢量算法。
- UseVectorsParallel: 并发的矢量算法。
3.2 Arm 架构
同样的源代码可以在 Arm 架构上运行。基准测试结果如下。
BenchmarkDotNet v0.14.0, macOS Sequoia 15.0.1 (24A348) [Darwin 24.0.0] Apple M2, 1 CPU, 8 logical and 8 physical cores .NET SDK 8.0.204 [Host] : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD DefaultJob : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD | Method | Width | Mean | Error | StdDev | Median | Ratio | RatioSD | |--------------------- |------ |-------------:|-----------:|-----------:|-------------:|------:|--------:| | Scalar | 1024 | 719.32 us | 0.215 us | 0.201 us | 719.34 us | 1.00 | 0.00 | | ScalarParallel | 1024 | 157.38 us | 1.423 us | 1.111 us | 157.25 us | 0.22 | 0.00 | | UseVectors | 1024 | 169.25 us | 0.538 us | 0.503 us | 169.40 us | 0.24 | 0.00 | | UseVectorsParallel | 1024 | 57.81 us | 0.998 us | 2.149 us | 58.11 us | 0.08 | 0.00 | | | | | | | | | | | Scalar | 2048 | 2,963.48 us | 6.674 us | 5.211 us | 2,961.39 us | 1.00 | 0.00 | | ScalarParallel | 2048 | 627.47 us | 11.680 us | 25.142 us | 616.63 us | 0.21 | 0.01 | | UseVectors | 2048 | 716.27 us | 2.097 us | 1.961 us | 717.02 us | 0.24 | 0.00 | | UseVectorsParallel | 2048 | 368.49 us | 7.320 us | 21.469 us | 378.95 us | 0.12 | 0.01 | | | | | | | | | | | Scalar | 4096 | 12,449.32 us | 177.868 us | 157.676 us | 12,508.13 us | 1.00 | 0.02 | | ScalarParallel | 4096 | 2,510.22 us | 34.541 us | 30.620 us | 2,501.37 us | 0.20 | 0.00 | | UseVectors | 4096 | 2,968.72 us | 20.503 us | 18.175 us | 2,965.71 us | 0.24 | 0.00 | | UseVectorsParallel | 4096 | 1,728.46 us | 4.362 us | 4.080 us | 1,729.00 us | 0.14 | 0.00 |
四、对算法进行检查
以往想对算法进行检查法时,直接对各个字节做相等比较就行了。
但“Bgr24彩色位图转为灰度的Bgr24位图”不适合那样的验证。由于整数运算有精度损失,造成部分像素值会有一些小的偏差。若直接对各个字节做相等比较,那么结果总是 false.
于是可以编写一个统计误差的函数。可通过误差的大小,来判断算法是否正确,以及比较算法的优劣。
private unsafe long SumDifference(BitmapData expected, BitmapData dst, out long countByteDifference, out int maxDifference) { const int cbPixel = 3; // Bgr24 store grayscale. long totalDifference = 0; countByteDifference = 0; maxDifference = 0; int maxPosX = -1, maxPosY = -1; int width = expected.Width; int height = expected.Height; int strideSrc = expected.Stride; int strideDst = dst.Stride; byte* pRow = (byte*)expected.Scan0.ToPointer(); byte* qRow = (byte*)dst.Scan0.ToPointer(); for (int i = 0; i < height; i++) { byte* p = pRow; byte* q = qRow; for (int j = 0; j < width; j++) { for (int k = 0; k < cbPixel; ++k) { int difference = Math.Abs((int)(*q) - *p); if (0 != difference) { totalDifference += difference; ++countByteDifference; if (maxDifference < difference) { maxDifference = difference; maxPosX = j; maxPosY = i; } } ++p; ++q; } } pRow += strideSrc; qRow += strideDst; } if (maxDifference > 0) { //Console.WriteLine(string.Format("SumDifference maxDifference={0}, at ({1}, {2})", maxDifference, maxPosX, maxPosY)); } return totalDifference; }
在 Setup 方法里增加检查代码。
// Check. bool allowCheck = true; if (allowCheck) { try { TextWriter writer = Console.Out; long totalDifference, countByteDifference; int maxDifference; double averageDifference; long totalByte = Width * Height * 3; double percentDifference; // Baseline ScalarDo(_sourceBitmapData, _expectedBitmapData); // ScalarParallel ScalarParallel(); totalDifference = SumDifference(_expectedBitmapData, _destinationBitmapData, out countByteDifference, out maxDifference); averageDifference = (countByteDifference > 0) ? (double)totalDifference / countByteDifference : 0; percentDifference = 100.0 * countByteDifference / totalByte; writer.WriteLine(string.Format("Difference of ScalarParallel: {0}/{1}={2}, max={3}, percentDifference={4:0.000000}%", totalDifference, countByteDifference, averageDifference, maxDifference, percentDifference)); // UseVectors UseVectors(); totalDifference = SumDifference(_expectedBitmapData, _destinationBitmapData, out countByteDifference, out maxDifference); averageDifference = (countByteDifference > 0) ? (double)totalDifference / countByteDifference : 0; percentDifference = 100.0 * countByteDifference / totalByte; writer.WriteLine(string.Format("Difference of UseVectors: {0}/{1}={2}, max={3}, percentDifference={4:0.000000}%", totalDifference, countByteDifference, averageDifference, maxDifference, percentDifference)); // UseVectorsParallel UseVectorsParallel(); totalDifference = SumDifference(_expectedBitmapData, _destinationBitmapData, out countByteDifference, out maxDifference); averageDifference = (countByteDifference > 0) ? (double)totalDifference / countByteDifference : 0; percentDifference = 100.0 * countByteDifference / totalByte; writer.WriteLine(string.Format("Difference of UseVectorsParallel: {0}/{1}={2}, max={3}, percentDifference={4:0.000000}%", totalDifference, countByteDifference, averageDifference, maxDifference, percentDifference)); } catch (Exception ex) { Debug.WriteLine(ex.ToString()); } }
字段说明:
- totalDifference: 所有像素误差值总和。
- countByteDifference: 发生误差的字节总数。
- averageDifference: 平均每个字节的误差值。越小越好。
- maxDifference: 最大误差值。即输出信息里的“max”。0表示完全匹配,12是正常,34表示误差较大,超过5一般是算法存在问题。
- percentDifference: 发生误差的字节总数,在整个图片中的比例。越小越好。
运行程序,可以看到相关的输出信息。
Difference of ScalarParallel: 0/0=0, max=0, percentDifference=0.000000% Difference of UseVectors: 422400/422400=1, max=1, percentDifference=13.427734% Difference of UseVectorsParallel: 422400/422400=1, max=1, percentDifference=13.427734%
“max”最大为“1”,表示字节的最大误差只有1。整数算法本身是存在舍入误差的,而现在只有1,表示误差已经控制的很好了,算法的质量很高了。
附录
- 完整源代码: https://github.com/zyl910/VectorTraits.Sample.Benchmarks/blob/main/VectorTraits.Sample.Benchmarks.Inc/Image/Bgr24ToGrayBgr24Benchmark.cs
- YGroup3Unzip 的文档: https://zyl910.github.io/VectorTraits_doc/api/Zyl.VectorTraits.Vectors.YGroup3Unzip.html
- YGroup3Zip 的文档: https://zyl910.github.io/VectorTraits_doc/api/Zyl.VectorTraits.Vectors.YGroup3Zip.html
- VectorTraits 的NuGet包: https://www.nuget.org/packages/VectorTraits
- VectorTraits 的在线文档: https://zyl910.github.io/VectorTraits_doc/
- VectorTraits 源代码: https://github.com/zyl910/VectorTraits