大型稀疏矩阵 的LU分解及特征值求解 陈英时 2016 . 1. 9 2013. 7. 20
稀疏矩阵求解的广泛应用 偏微分方程,积分方程,特征值,优化… 万阶以上dense matrix不可行 时间瓶颈,内存,外存等瓶颈 矩阵求解是数值计算的核心[1] 稀疏矩阵求解是数值计算的关键之一 偏微分方程,积分方程,特征值,优化… 万阶以上dense matrix不可行 稀疏矩阵求解往往是资源瓶颈 时间瓶颈,内存,外存等瓶颈 直接法基于高斯消元法,即计算A的LU分解。A通常要经过一系列置换排序,可归并为左置换矩阵P,右置换矩阵Q。基本步骤如下: 1)符号分析: 得到置换排序矩阵 P Q 2)数值分解: 3)前代 回代: I.S.Duff, A.M.Erisman, and J.K.Reid. Direct Methods for Sparse Matrices. London:Oxford Univ. Press,1986. J.J.Dongarra,I.S.Duff, ... Numerical Linear Algebra for High-Performance Computers. G.H.Golob, C.F.Van loan. Matrix Computations. The Johns Hopkins University Press. 1996
稀疏矩阵复杂、多变 基本参数 对称性,稀疏性,非零元分布 敏感性,病态矩阵 条件数 格式多变 测试集 Harwell-Boeing Exchange Format 。。。 测试集 Harwell-Boeing Sparse Matrix Collection UF sparse matrix collection
求解器的飞速发展 BBMAT http://www.cise.ufl.edu/research/sparse/matrices/Simon/bbmat.html 38744阶,分解后元素超过四千万. 1988 巨型机cray-2上 >1000秒 2003 4G umfpack4 32.6秒[4] 2006 3.0G GSS1.2 15秒 2012 3.0G 4核 GSS 2.3 4秒 2014 i7 8核 GSS 2.4 1秒 硬件的发展 CPU,内存等 稀疏算法逐渐成熟 稀疏排序,密集子阵 multifrontal ,supernodal… 数学库 BLAS,LAPACK www.netlib.org
稀疏LU分解算法的关键 稀疏LU分解 (不考虑零元的)LU分解 1 零元是动态的概念,需要稀疏排序,减少注入元(fill-in) 2 密集子阵 根据符号分析,数值分解算法的不同,直接法有以下几类[15]: 1)general technique(通用方法): 主要采用消去树等结构进行LU分解。适用于非常稀疏的非结构化矩阵。 2)frontal scheme(波前法) LU分解过程中,将连续多行合并为一个密集子块(波前),对这个子块采用BLAS等高效数学库进行分解。 3)multifrontal method(多波前法) 将符号结构相同的多行(列) 合并为多个密集子块,以这些密集子块为单位进行LU分解。这些子块的生成,消去,装配,释放等都需要特定的数据结构及算法。
稀疏排序 排序算法的作用是减少矩阵LU分解过程中产生的注入元,求解矩阵的最优排序方法是个NP完全问题(不能够在合理的时间内进行求解),对具体矩阵而言,目前也没有方法或指标来判定哪种算法好。因此实测不同的算法,对比产生的注入元,是确定排序算法的可靠依据。 主要的排序算法有最小度排序(MMD,minimum degree ordering algorithm)和嵌套排序(nested dissection)两种。矩阵排序方面相应的成熟软件库有AMD[12] 、COLAMD、METIS[13]等。 Yannokakis M. Computing the minimum fill-in in NP-Complete. SIAM J.Algebraic Discrete Methods, 1981, 2:77~79
近似最小度排序算法 – 商图 近似最小度排序(AMD,Approximate Minimum Degree OrderingAlgorithm)算法于1996年左右由Patrick R. Amestoy等学者提出 AMESTOY, P. R., DAVIS, ... 1996a. An approximate minimum degree ordering algorithm. SIAM J. Matrix Anal. Applic. 17, 4, 886–905.
为何需要密集子块(Dense Matrix) http://eigen.tuxfamily.org/index.php?title=Benchmark
多波前法(multifrontal)简介 发展 Duff and Raid [2] J.W.H.Liu等分析,改进 [3] T.A.Davis开发UMFPACK [4] 基本算法 利用稀疏矩阵的特性,得到一系列密集子阵(波前)。将LU分解转化为对这些波前的装配,消去,更新等操作。 多波前法的优点 波前是dense matrix ,可直接调用高性能库(BLAS等) 密集子阵可以节省下标存储 提高并行性 目前主要的求解器 UMFPACK,,GSS,MUMPS,PARDISO, WSMP,HSL MA41等
LU分解形成frontal 10阶矩阵。 蓝点代表非零元。红点表示分解产生的注入元(fill-in) Frontal划分{a}, {b}{c}{d} {e} {f,g}{h,i,j}
Frontal的装配,消去,更新过程 {a} {c} {b} {f,g} {e} {d} {h,i,j} 消去树 a c h c · · c,g,h g · · h · · b e j e · · j · · {c} {b} d,i,j i · · j · · f,g,h g g · h · · e,i,j i · · j · · {f,g} {e} {d} h,i, j i i · j · j {h,i,j} 消去树
消去树 消去树是符号分析的关键结构,其每个节点对应于矩阵的一列(行),该节点只与其父节点相连,父节点定义如下: J.W.H.Liu. The Role of Elimination Trees in Sparse Factorization. SIAM J.Matrix Anal.Applic.,11(1):134-172,1990. J.W.H.Liu. Elimination structures for unsymmetric sparse LU factors. SIAM J. Matrix Anal. Appl. 14, no. 2, 334--352, 1993.
GSS简介 高校,研究所 中国电力科学研究院 香港大学 南京大学 河海大学 中国石油大学 电子科技大学 三峡大学 山东大学 标准C开发,适用于各种平台 比INTEL PARDISO更快,更稳定 CPU/GPU 混合计算 突破32位Windows内存限制 32个用户参数 支持用户定制模块 高校,研究所 中国电力科学研究院 香港大学 南京大学 河海大学 中国石油大学 电子科技大学 三峡大学 山东大学
他们为什么选择GSS? user detail Why they choose GSS crosslight Industry leader in TCAD simulation Hybrid GPU/CPU version, more than 2 times faster than PARDISO, MUMPS and other sparse solvers. soilvision The most technically advanced suite of 1D/2D/3D geotechnical software Much faster than their own sparse solver. FEM consulting The leading research teams in the area of the Finite Element Method since 1967 GSS is faster than PARDISO and provide many custom module. GSCAD Leading building software in China GSS provide a user-specific module to deal with ill-conditioned matrix. ICAROS A global turnkey geospatial mapping service provider and state of the art photogrammetric technologies developer. GSS is faster than PARDISO. Also provide some technical help. EPRI China Electric Power Research Institute 3-4 times faster than KLU
GSS – 加权消去树 工作量消去树 基于消去树结构来计算数值分解的工作量,将LU分解划分为多个独立的任务,为高效并行计算奠定基础。
GSS -双阈值列选主元算法
GSS - CPU/GPU混合计算 1) After divides LU factorization into many parallel tasks, GSS will use adaptive strategy to run these tasks in different hardware (CPU, GPU …). That is, if GPU have high computing power, then it will run more tasks automatically. If CPU is more powerful, then GSS will give it more tasks. 2) And furthermore, if CPU is free (have finished all tasks) and GPU still run a task, then GSS will divide this task to some small tasks then assign some child-task to CPU, then CPU do computing again. So get higher parallel efficiency. 3) GSS will also do some testing to get best parameters for different hardware.
GSS – 求解频域谱元方法生成的矩阵 一次LU分解 符号分解时间 50秒 数值分解时间 46秒 回代 2.5秒 需要求解多个右端项 矩阵较稠密 约40万阶,15亿个非零元 GSS约需15G内存 一次LU分解 符号分解时间 50秒 数值分解时间 46秒 回代 2.5秒 需要求解多个右端项 32个右端项 需80秒? 估计80/4=20秒 进一步优化 CPU/GPU混合计算 数值分解约35秒 重复利用符号分析结果 根据矩阵的特殊结构来进一步减少非零元
对比测试 The test matrices are all from the UF sparse matrix collection PARDISO is from Intel Composer XE 2013 SP1. GSS 2.4 use CPU-GPU hybrid computing. The testing CPU is INTEL Core i7-4770(3.4GHz) with 24G memory. The graphics card is ASUS GTX780 (with compute capability 3.5). NVIDIA CUDA Toolkit is 5.5. The operating system is Windows 7 64. Both solvers use default parameters. For large matrices need long time computing, GSS 2.4 is Nearly 3 times faster than PARDISO. For matrices need short time computing, PARDISO is faster than GSS. One reason is that complex synchronization between CPU/GPU do need some extra time.
大型稀疏矩阵的特征值求解 150 Years old and still alive: eigenproblems 为特征值,x为对应的特征向量 150 Years old and still alive: eigenproblems 1997 --- by Henk A. van der Vorst , Gene H. Golub 稀疏LU分解 - 理论上即高斯消元法 稀疏特征值 – 趋近于纯数学 代数特征值问题 J.H.Wilkinson Matrix Algorithms || EigenSystems. G.W.Stewart G.H.Golob, C.F.Van loan. Matrix Computations. The Johns Hopkins University Press. 1996 Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. Zhaojun Bai , …. 2000 重视
幂迭代 -- Power iteration Math 685/CSI 700 Spring 08 George Mason University, Department of Mathematical Sciences
幂迭代 -- 一个形象的解释 Math 685/CSI 700 Spring 08 George Mason University, Department of Mathematical Sciences
瑞利商迭代 -- Rayleigh Quotient iteration Math 685/CSI 700 Spring 08 瑞利商迭代 -- Rayleigh Quotient iteration George Mason University, Department of Mathematical Sciences
子空间迭代
Krylov子空间 Math 685/CSI 700 Spring 08 George Mason University, Department of Mathematical Sciences
Arnoldi 迭代 Math 685/CSI 700 Spring 08 George Mason University, Department of Mathematical Sciences
Arnoldi 迭代的基本算法 ARPACK www.caam.rice.edu/software/ARPACK/ Math 685/CSI 700 Spring 08 Arnoldi 迭代的基本算法 ARPACK www.caam.rice.edu/software/ARPACK/ ON RESTARTING THE ARNOLDI METHOD FOR LARGENONSYMMETRIC EIGENVALUE PROBLEMS, MORGAN R.B. Lehoucq. Analysis and Implementation of an Implicitly Restarted Iteration. PhD thesis George Mason University, Department of Mathematical Sciences
Arnoldi 迭代求解特征值 Math 685/CSI 700 Spring 08 George Mason University, Department of Mathematical Sciences
Krylov分解 -- 基于酉相似变换(unitary similarity ) 标准Arnoldi分解 广义Krylov分解 Matrix Algorithms || EigenSystems. G.W.Stewart P.309 其中Q为酉矩阵,且可以连续叠加 重视 基于酉相似变换的分解具有后向稳定性 ---Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide P.173
Krylov-schur分解的优点 易于挑选ritz值作为implicit shift Schur分解将任何一个矩阵归约为上三角矩阵,对角线即为该矩阵的特征值;并且在这条对角线上,特征值可以通过酉相似变换来任意排列。也就是说,在生成Rayleigh矩阵,并计算出所有的ritz值之后,可以把需要的Ritz值排到前面,而不需要的Ritz值排到后面,重启之后,只有挑出来的Ritz值出现在序列中。 易于Deflation(Lock+Purge) Deflation操作的基本思路也类似,更加复杂。 实测更效率 实测迭代次数,运行时间都减少约1/3。(与ARPACK对比) G. W. Stewart, A Krylov–Schur algorithm for large eigenproblems, SIAM J. Matrix Anal. Appl., 23 (2001), pp. 601–614.
Krylov-schur及其重启 考虑重启后,B矩阵更加复杂,如右图所示 包含重启的Krylov-Schur分解算法 Matrix Algorithms || EigenSystems. G.W.Stewart P329-330
收敛速度更快 经多个实际算例验证,其速度明显快于目前通用的ARPACK,一般迭代次数仅为ARPACK的60%-70%。 Why?
Subspace iteration with approximate spectral projection 最新算法 Subspace iteration with approximate spectral projection 基于cauch积分的spectral projection 可求出复平面内指定区域内的所有特征值 主要用于对称矩阵,需推广到非对称矩阵 FEAST AS A SUBSPACE ITERATION EIGENSOLVER ACCELERATED BY APPROXIMATE SPECTRAL PROJECTION --- P.TAK, P.TANG, E.POLIZZI
shift-invert变换 标准的shift-invert变换 Matrix transformations for computing rightmost eigenvalues of large sparse non-symmetric eigenvalue problems -- K.MEERBERGEN, D.ROOSE
Cayley 变换 Cayley变换 Cayley变换特性 1 平行于虚轴的直线 映射到单位圆 2 该直线左侧的点被映射到单位圆内部 1 平行于虚轴的直线 映射到单位圆 2 该直线左侧的点被映射到单位圆内部 3 该直线右侧的点被映射到单位圆外部 Matrix transformations for computing rightmost eigenvalues of large sparse non-symmetric eigenvalue problems -- K.MEERBERGEN, D.ROOSE
Filter polynomial --- Matrix Algorithms || EigenSystems. G.W.Stewart P.317
附一 Aleksei Nikolaevich Krylov (1863–1945) showed in 1931 how to use sequences of the form {b, Ab, A2b, . . .} to construct the characteristic polynomial of a matrix. Krylov was a Russian applied mathematician whose scientific interests arose from his early training in naval science that involved the theories of buoyancy, stability, rolling and pitching, vibrations, and compass theories. Krylov served as the director of the Physics–Mathematics Institute of the Soviet Academy of Sciences from 1927 until 1932, and in 1943 he was awarded a “state prize” for his work on compass theory. Krylov was made a “hero of socialist labor,” and he is one of a few athematicians to have a lunar feature named in his honor—on the moon there is the “Crater Krylov.” Walter Edwin Arnoldi (1917–1995) was an American engineer who published this technique in 1951, not far from the time that Lanczos’s algorithm emerged. Arnoldi received his undergraduate degree in mechanical engineering from Stevens Institute of Technology, Hoboken, New Jersey, in 1937 and his MS degree at Harvard University in 1939. He spent his career working as an engineer in the Hamilton Standard Division of the United Aircraft Corporation where he eventually became the division’s chief researcher. He retired in 1977. While his research concerned mechanical and aerodynamic properties of aircraft and aerospace structures, Arnoldi’s name is kept alive by his orthogonalization procedure.
附二 参考文献 [1] Numerical Analysis. Rainer Kress. Springer-Verlag. 1991 [2] I.S.Duff, A.M.Erisman, and J.K.Reid. Direct Methods for Sparse Matrices. London:Oxford Univ. Press,1986. [3] J.W.H.Liu. The Multifrontal Method for Sparse Matrix Solution: Theory and Practice. SIAM Rev., 34 (1992), pp. 82--109. [4] T.A.Davis. A column pre-ordering strategy for the unsymmetric-pattern multifrontal method, ACM Trans. Math. Software, vol 30, no. 2, pp. 165-195, 2004. [5] N.J.Higham. Accuracy and Stability of Numerical Algorithms. SIAM,2002 [6] G..H.Golob, C.F.Van loan. Matrix Computations. The Johns Hopkins University Press. 1996 [7] J.W.H.Liu. The Multifrontal Method for Sparse Matrix Solution: Theory and Practice. SIAM Rev., 34 (1992), pp. 82--109. [8] Fast PageRank Computation via a Sparse Linear System (Extended Abstract) Gianna M. Del Corso,Antonio Gullí, Francesco Romani. [9] Y.Saad, Iterative Methods for Sparse Linear Systems, PWS, Boston,1996 [10] Y.S. Chen* ,Simon Li. Application of Multifrontal Method for Doubly-Bordered Sparse Matrix in Laser Diode Simulator. NUSOD,2004 [11] 陈英时 吴文勇等. 采用多波前法求解大型结构方程组.建筑结构,2007年09期 [12]宋新立,陈英时等.电力系统全过程动态仿真中大型稀疏线性方程组的分块求解算法
非常感谢各位老师,同学! MAIL: gsp@grusoft.com QQ: 304718494 www.grusoft.com