Speaker：Yeong-Luh Ueng 2018/4/17

Speaker：Yeong-Luh Ueng 2018/4/17
A SHUFFLE-BASED ITERATIVE DEMODULATION AND DECODING SCHEME FOR LDPC CODED FLASH MEMORY Li-Chung Lee, Wei-Min Lai, Mao-Ruei Li, Yeong-Luh Ueng Dept. Electrical Engineering National Tsing Hua University, Hsinchu, Taiwan Speaker：Yeong-Luh Ueng 2018/4/17

Outline Introduction Preliminary Proposed Shuffle-based IDD Receiver
NAND Flash Preliminary LDPC coded modulation using a very sparse LDPC code Layer-based IDD (Iterative demodulation and decoding) receiver Proposed Shuffle-based IDD Receiver Design challenge Hardware-friendly structure interleaver Optimized memory bank interface Simulation Results Conclusion This slide shows the outline of this talk. Firstly, we will introduce the NAND Flash. Then, the IDD scheme for LDPC coded modulation will be reviewed. The previous works about the layered-based IDD receiver will be introduced. Next, we will focus on the proposed shuffled-based IDD. Finally, we will show the simulation results and conclude this talk. 主要可以分為五個部分首先，先介紹NAND Flash之結構與特性再來，回顧IDD系統與LDPC碼調變，以及Layer-based IDD 接收器接著，為了增加硬體效益，我們提出使用shuffle-based IDD 接收器取代layer-based IDD 接收器，並且解決並優化其中遇到的問題。最後是模擬結果以及結論

NAND Flash Preliminary LDPC coded modulation using a very sparse LDPC code Layer-based IDD receiver Proposed Shuffle-based IDD Receiver Design challenge Hardware-friendly structure interleaver Optimized memory bank interface Simulation Results Conclusion

Introduction to TLC Flash
Information is stored in floating-gate transistors. About TLC Higher density Degraded reliability and performanc Single-Level Cell, SLC 1bit (2 states) Multi-Level Cell, MLC 2bits (4 states) Triple-Level Cell, TLC 3bits (8 states) Flash memory is a kind of non-volatile storage. Information is stored using floating-gate transistors. It has many advantages such as small physical size, low power consumption and high storage density. Therefore, Flash memory has become more and more popular in recent years. Single level cell, multi level cell, and triple level cell can be used to stored 1 bit, 2 bits and 3 bits, respectively.

More threshold voltages for TLC
Flash Model Using threshold voltage (VRef) to read Flash cell More threshold voltages for TLC TLC memory cell VRef0 VRef1 VRef2 VRef3 VRef4 VRef5 VRef6 111 110 100 101 001 011 010 000 A threshold voltage is used to determine which value is stored in the Flash cell. For example, if the sensed voltage is greater than the threshold voltage, this cell will be determined as zero. More than one threshold level should be used to sense the date bits stored in the TLC.

TLC Flash Model Hard-decision / soft-decision memory sensing
Using more than one threshold voltage Increased read latency Hard-decision sensing Soft-decision sensing SLC Model SLC Model Modeled as PAM modulation using Gray mapping Modeled as 2-/ 4-/8-PAM modulation for SLC/ MLC/ TLC TLC Model 111 110 100 001 011 010 000 101 There are two methods to sense data: hard-decision and soft-decision. Conventionally, hard-decision sensing is used because the read latency is short. But! When the data reliability decreases, the control system will start soft-decision to prolong the Flash lifetime. The SLC/ MLC/TLC can be modeled as 2-PAM, 4-PAM and 8-PAM modulation schemes, respectively. Conventionally, Gray mapping is applied to the MLC and TLC since the neighboring level only differs in one bit. We can find that the storage density of the TLC is 3 times compared to the SLC. However, the TLC data reliability is much less than the SLC. More powerful error correction codes such as the LDPC codes are necessary for the TLC Flash.

NAND Flash Preliminary LDPC coded modulation using a very sparse LDPC code Layer-based IDD receiver Proposed Shuffle-based IDD Receiver Design challenge. Hardware-friendly structure interleaver Optimized memory bank interface Simulation Results Conclusion

Preliminary: LDPC Coded Modulation
LDPC coded modulation schemes in [5][6] Non-Gray mapping Very sparse parity-check matrix Advantages Reduce complexity Improve decoding throughput 𝑑 𝑣 2 3 4 5-9 Gray mapping 0.04 0.2 0.44 0.32 Non-Gray [5][6] 0.9 0.1 Conventional Matrix Matrix [6] Conventionally, the Gray mapping is applied to the TLC. The authors in [5][6] proposed an LDPC coded modulation scheme based on a non-Gray mapping. The resultant parity-check matrix is very sparse. Look at this figure, the white blocks are all zero sub-matrices. We can find that the number of zero sub-matrices is larger than the upper matrix. This means that the complexity of the LDPC decoder can be decreased and the decoding throughput can be increased significantly. [5] J.-H. Shy, “LDPC coded modulation and its applications to MLC flash memory,” NTHU Thesis, 2014. [6] H.-C. Lee, J.-H. Shy, Y.-M. Chen, and Y.-L. Ueng, “LDPC coded modulation for TLC flash memory,” IEEE Information Theory Workshop(ITW), Nov.2017.

Preliminary –LDPC Coded Modulation
Iterative demodulation and decoding (IDD) can enhance the error-rate performance. [4] Complex interface b/w demodulator and decoder Lower throughput This figure shows an LDPC coded 8PAM scheme together with the iterative demodulation and decoding (IDD) receiver. In the IDD receiver, soft information or log-likelihood ratio (LLR) message is exchanged between the decoder and demodulator. As a result, the error-rate performance is expected to be better than the conventional non-IDD receiver. However, the IDD receiver has a high hardware complexity and area-cost and hence it is rarely adopted in a practical system. 在IDD系統中，資料由LDPC編碼器完成編碼，並使用交錯器將位元打亂，最後經由8-PSK調變器將每三個位元轉為一個符元後存入FLASH當中。符元從Flash中讀出後，會經由解調器產生通道LLR值 Lc，並透過反向交錯器後，傳至LDPC解碼器進行解碼。而解碼器由通道LLR值產生外值訊息Le回傳給解調器幫助解碼。如此訊息在解調器與解碼器之間來回傳遞的系統我們稱之為IDD系統非IDD系統，則是解碼器回傳訊息給解調器。 IDD系統相較於非IDD系統可以提升解碼效能。然而相對的也有較高的硬體複雜度與面積花費。 [4] F. Schreckenbach, et al., “Optimization of symbol mappings for bit-interleaved coded modulation with iterative decoding,” IEEE COMMUN LETT, pp. 593–595, 2003.

Preliminary: Layer-based IDD Receiver [7]
The IDD receiver proposed in [7] Two-codeword schedule The L 𝑐 and 𝐿 𝑒 memory are doubled Layered decoding is commonly used for LDPC codes. In an IDD receiver, data dependency exists between the demodulator and the decoder. The demodulator is idle until the layered LDPC decoder finishes the row decoding process. In a similar way, the decoder is idle when the demodulator works. In order to enhance the hardware efficiency, the authors in [7] proposed a two-codeword scheme, where the decoder and the demodulator process two different codewords at the same time. However, this architecture doubles the memory size in order to store information for the two different codewords. In this paper, we try to use a shuffled-based architecture to simply the data dependency. Since the shuffled-based decoding is a block-column-wise decoding, the demodulator can begin the demodulation process after the decoder finishes the computation for a single block-column. 在[7]中在IDD接收器中使用layer排程的LDPC解碼器，解碼過程如圖示，解碼時會依據教驗矩陣列方向順序解碼並且需等到所有列都解碼完成之後才能計算外值訊息並回傳給解調器。故而造成解調器與解碼器的硬體閒置問題。為了解決此問題在[7]中提出雙碼字排程技術讓解碼器與解調器在同一時間分別處理不同的碼字，以提高硬體使用效益相對的也造成了兩倍碼字的Lc 與 Le 的記憶體儲存量。然而若是將LDCP解碼器的解碼排成改為shuffle-based? 由圖可以看見在解完第一行之後即可先回傳外值訊息給解調器解調，不需要等其他行的運算。 [7] M. R. Li, T. Y. Kuan, H. C. Lee and Y. L. Ueng, “An IDD receiver of LDPC coded modulation scheme for flash memory applications,” 2016 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), Jeju, 2016, pp

Proposed Shuffle-Based IDD Scheme
Advantages of the shuffle-based scheme One-codeword decoding Reduction of the memory requirement Challenges Interface design Hardware idle In this paper, we propose a shuffle-based IDD scheme to reduce the memory requirement. It is not necessary to process two codewords at the same time. However, there are some challenges need to be overcome. The first is to design an efficient interface between the decoder and demodulator. The other is to avoid hardware idle which will result in a lower decoding throughput. 為了減少[7]中，雙碼字技術所帶來的龐大記憶體儲存量，在本篇提出使用shuffle-based LDPC 解碼器取代layer-based 解碼器然而直接取代卻會發生一些問題在其中。

Proposed Shuffle-Based IDD Receiver
Challenge in LDPC decoder This figure shows the proposed shuffle-based IDD receiver. In the LDPC decoder, the C2V messages are recovered by the data stored the Min register and the V2C-sign memory. Then, the C2V messages are used to calculate the APP values, and also are stored in the FIFO register. After finishing the APP calculation, the V2C calculator works in order to calculate the V2C messages. The V2C sign memory will be updated in this stage. Now, the comparator will find the first two minimum values and the associated first minimum index. In order to exchange the message between the decoder and demodulator, the extrinsic LLR values are computed in the Le substractor unit, and the extrinsic values are stored in the Le memory. The demodulator will calculate its extrinsic messages in the next cycle. There are 2 interface issues limiting the receiver throughput. One is in the decoder which results in the idle issue in the APP calculation. The other is the memory interface. 如圖為 shuffle-based IDD接收器的架構，我們將LDPC解碼器的部分替換成shuffle排程的解碼器。在LDPC解碼的過程中，首先，會由Min Register 與 V2C-sign memory還原上一次迭代的資料的C2V值，此後將分為兩條路線，一條是將之與通道LLR值加總計算APP，另外一條則是存入FIFO等待用於V2C的運算。接著，當APP計算完成之後，將會傳給V2C Calculator做V2C運算並形轉成sign magnitude 的形式，將sign值存入V2C-sign memory，而量值則會經由比較器比較最新的最小值與次小值之後，經由barrel shifter轉成下一次計算要用的順序後存入Min Register。另外一方面，計算好的APP值可以經由Le substractor 與通道LLR相減產生外值訊息放入外值訊息的記憶體中，供給解調器計算下一次迭代的通道LLR。第一個問題點出現在LDPC解碼器內部當中，第二個問題則是出現在解調器與解碼器之間的記憶體連接介面。 Challenge in memory interface

Challenge - decoder The degree of the (j)th block column is larger than that of the (j+1)th block column Idle issue Decrease throughput Solution: Arrange the block columns based on an increasing degree In the shuffled-based scheme, the column-degree distribution of the parity check matrix affects the decoding throughput. Now consider the case that the weight of the jth block column is larger than that of the (j+1)th block column. When the C2V recover and the APP adder are ready to output data to the V2C calculator for the (j + 1)th block column, the V2C calculator is still busy on processing the data for the jth block column. In other words, there are some idle units in the decoder and the decoding throughput decreases. In order to avoid this problem, we arrange the block columns based on an increasing degree. 由於V2C的計算需要等待APP值算好才能開始，因此當第j個區塊行級數大於第j+1個區塊行級數時， APP adder 與 C2V recover 為了等待V2C calculator完成計算，故而需要一個CLK的IDLE 若解碼排成上有許多大小交錯的區塊行級數的話，IDLE的時間將會大幅增加為了解決此問題，我們將校驗矩陣的區塊行進行交換，使其區塊行級數呈現遞增的形式。改善後的排程範例如圖，相較於之前的18個CLK，經過調整後的解碼排成只需要17個CLK

Challenge – memory interface
Hardware idle : Random interleaver Decrease throughput Solution : Propose hardware-friendly structure interleaver After resolving the decoder idle issue, now, we now focus on the decoder-demodulator interface. If a random interleaver is used, it is likely that bits corresponding to a single 8-PAM symbol belong to non-consecutive block columns. The demodulator and the decoder are necessary to wait for the desired information for a long time, resulting in a low decoding throughput. In order to optimize this interface, we propose using a structure interleaver. Using the proposed method, the decoder only needs to compute and send the Le values to the demodulator for three consecutive blocks rather than all block columns. Look at this figure, the shuffle-based IDD scheme is able to realize the one-codeword processing with minimized hardware idle. 而另外一個問題則是發生在解調器與解碼器之間的溝通，若在解調器與解碼器之間使用Random interleaver，意味著解調器計算好的通道LLR值，不一定是當前解碼器所需要的位置。相對的解碼器先解完的外值訊息，也不一定能夠組成解調器當下計算所需要的所有符元。此問題不僅導致硬體出現大量的閒置問題，並且在編碼器與解調器之間的介面設計也變得相當困難。因此在本篇中提出對於硬體友善的交錯結構，將解調器所要計算的一個循環大小的符元對應到的位置集中在相鄰的三個區塊之中。使得在解調器與解碼器之間的排程上能夠更圓滑順暢。

Proposed Optimized Memory Interface
Updating of demodulator LLR 𝐿 𝑐,𝑗~𝑗+2 only requires extrinsic message 𝐿 𝑒,𝑗~𝑗+2 . Conventional Interface Proposed Interface Z×3 FIFO buffer The demodulator can update its LLR values when the decoder provides its extrinsic LLR values for three consecutive block columns. The Le values for block columns from 0 to j -1 and block columns from j + 3 to G -1 do not need to be buffered, and hence, the size of the Le buffer can be reduced significantly. 然而在此之中我們發現了一件事情由於解調器在計算通道LLR值時，僅需對應到之連續三個區塊的外值訊息即可，因此原本儲存外值訊息的memory bank可以替換成3個循環大小的緩衝器即可。如此大幅地降減少了儲存通道LLR值與外值訊息的記憶體。 Save 50% memory requirement

BER Results 0.08 dB 0.05 dB This figure shows the BER performance, where a 2KByte code is used. It is observed that the proposed structure interleaver can also improve decoding performance by almost 0.1 dB. In addition, the decoding performance of shuffled-based IDD scheme is better than the layered-based scheme. 在本篇中採用shuffle-based IDD接收器其錯誤率表現如稜形方塊線，與layer-based IDD有差不多的錯誤率表現如米字線並且使用了我們提出的對於硬體友善的交錯結構後，錯誤率表現有明顯的提升，接近並且好於傳統非IDD系統。而相較於layer-based IDD則好了約0.05db。

Hardware Complexity This slide shows the improvements in the Lc and Le memory usage, where the two-codeword layered scheme is used as the based-line. When the shuffled-based IDD scheme is adopted, it has a 19.7% reduction in gate count. After optimizing the IDD system, a reduction in gate count of 53.5% is able to be achieved. 在解調器與解碼器中間的記憶體連接介面也有大幅的減少，最左邊為two-codeword layered-baded IDD receiver的記憶體使用量，中間的為改為one codeword shuffled IDD receiver，雖然從雙碼字的使用量降為單碼字，看似減少了一半的儲存量值，但是由於只是記憶體深度的改變，因此減少不大最右邊的則是優化過後的架構，將儲存外值訊息的記憶體替換成3個循環大小的緩衝器後，相較於最右邊減少了53.5%。

Comparison Results Gray-based non-IDD Layer-based IDD[7] Shuffle-based
Gray-based non-IDD Layer-based IDD[7] Shuffle-based IDD Code 8PSK + (18432, 16704) Technology 90nm Algorithm NMS Max. Iteration number 15 Quantization (bits) 5 6 Clock frequency(MHz) 166 190 Throughput (Mbps) 679.9 1100 1555 Gate count(K) 1297 1891 1888 Area ( mm 2 ) 3.66 5.33 5.32 Hardware efficiency (Mbps/ mm 2 ) 185.76 206.37 292.19 This is a comparison table. By this table, we can find that this work achieves a better hardware efficiency. 最後是硬體的模擬結果，由表中我們可以看出相較於非IDD與layered-based IDD，本篇提出之架構有更好的硬體使用效益。 41.2% Improvement

Conclusion Propose an efficient shuffle-based IDD receiver for TLC applications Hardware-friendly structure interleaver Labelling bits corresponding to a single 8-PAM symbol are distributed in three consecutive block columns Improve decoding throughput Decrease design complexity of memory interface Optimized memory interface Using a small buffer Reduce area cost In this talk, we have presented an efficient shuffled-based IDD receiver. Compared to the layered-based IDD receiver, the shuffled-based receiver does not require to double memory size in order to store two-codeword information. Secondly, we have presented a hardware-friendly structure interleaver to enhance the decoding throughput. Finally, the optimized memory bank reduces the memory requirements for the interface between the decoder and the demodulator. According to the simulation results, we think that the shuffled-based IDD receiver has a great potential to be used in the next generation storage and communication systems. 在本篇論文中，提出高硬體效益的shuffle-based IDD接收器。在layer-based IDD接收器中為了增加硬體使用效益，使用雙碼字排程技術，而在本篇論文中使用shuffle-based LDPC解碼器去代layer-based 解碼器。提出對於硬體友善的交錯結構，使得解碼過程更加的順暢，以及減少硬體設計的複雜度。最後針對解碼器與解調器之間的記憶體連接介面進一步優化設計。使用緩衝器取代memory bank的使用，大幅減少了硬體使用面積。相較於layer-based IDD接收器，硬體效益高出40%。

Speaker：Yeong-Luh Ueng 2018/4/17

Similar presentations

Presentation on theme: "Speaker：Yeong-Luh Ueng 2018/4/17"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Speaker：Yeong-Luh Ueng 2018/4/17

Similar presentations

Presentation on theme: "Speaker：Yeong-Luh Ueng 2018/4/17"— Presentation transcript:

Similar presentations

About project

反馈