Union Memory PCIe 5.0 SSD Supports the Full-Process Large Model Training with High Reliability through Hardware-Software Synergy 

Union Memory PCIe 5.0 SSD Supports the Full-Process Large Model Training with High Reliability through Hardware-Software Synergy

Date: May 30, 2025 Read: 21 Share:

微信扫一扫:分享

使用微信扫一扫

http://en.unionmem.com/news_detail.php?menuid=31&id=21

Currently, the full-process large model training has broken through the traditional boundary of data storage systems. As a core component of AI computing infrastructure, enterprise-grade SSDs play a critical role in ensuring training efficiency and stability of large models through high reliability, high performance, and intelligent management capabilities.
From massive data preprocessing to frequent parameter iterations and from fine-tuning to real-time inference, every stage of large model training requires a balance between “hard metrics” and “soft competencies” of storage devices. From a reliability perspective, Union Memory’s new-generation PCIe Gen5 ESSD UH812a/UH832a can efficiently support the full-process large model training.


Reliability Hard Metrics: Gracefully Tackling the Data Deluge "Endurance Race"


During the training of large models, massive datasets ranging from 10TB to 100PB need to be processed, featuring high read/write frequencies and heavyloads. ESSDs meet the challenges with “hard metrics” such as high durability, large capacity, and mixed read-write performance.
UH812a/UH832a is equipped with the latest PCIe 5.0 interface. Its performance in core indexes such as storage bandwidth, latency, density, durability, data integrity, lifespan, and stability surpasses industry standards and ranks among the best in its class when compared to the products of the same generation.


Ø High-Speed Bandwidth & Ultra-Low Latency 

PCIe 5.0 high-speed interface: It supports both single and dual ports as well as NVMe 2.0 protocol. Compared with PCIe 4.0, the bandwidth is doubled, which can efficiently handle the cleaning, labeling and format conversion of massive unstructured data (texts, images, etc.).
High throughput: With sequential read and write speeds of 14900 MB/s-10500 MB/s, its peak performance leads the industry among products of the same generation.

Ultra-low latency: The 4KB random read QD1 latency is ≤55μs, representing a 43% improvement compared to the previous generation (UH811a series).



Ø High density and large capacity
15.36TB per disk: 15.36TB SSD with typically offers a TBW range of 28PBW-70PBW, meeting the large model parameter storage demands, while reducing the cost of data migration.

Ø Error rate and data integrity
UBER: 1E-18. According to the JESD218A standard (SSD reliability test method), ESSD UBER meets the requirement of ≤1E-17, while some high-end products can achieve 1E-18 through technical optimizations.

Ø High durability
DWPD: Up to 3 DWPDs (UH832a).Within the 5-year warranty period, it can support 3 DWPDs, making it well-suited for applications involving mass data write workloads.


Ø High confidence 
MTBF: ≥2.5 million hours, test across more than 1200 disks
AFR: ≤0.35%
According to the OCP standards, ESSDs meet these standards: MTBF ≥2 million hours (operating temperature: 0°-55°C) and AFR ≤0.44%. UH812a/UH832a features high-confidence MTBF and AFR metrics, easily meeting the demands of model training scenarios.

System-level reliability soft competencies: Millisecond-level response for “agility battle” of inference

As the model fine-tuning and inference stages begin, the data size decreases, making parameter reading speed and model loading speed even more critical. Storage demands shift toward low latency and high quality of service (QoS). At this moment, ESSDs need to respond swiftly with soft competencies to the agility battle of inference.
UH812a/UH832a is designed to meet the typical demands of AI inference scenarios, integrating system-level reliability designs such as algorithm optimization, fault tolerance and recovery mechanisms, intelligent monitoring and maintenance, and data protection. With years of comprehensive testing and validation experience, we have built a multidimensional “soft competency” assurance system.


Ø  Firmware algorithm optimization
Enhanced LDPC error correction algorithm: It provides a higher error correction capability than Flash memory requirements, accurately identifying and correcting various errors that occur during data transmission and storage. LDPC + DSP algorithm engine integrates hard decision, soft decision, and DSP, extending the Flash lifespan by 5 times.
Intelligent wear leveling technology: It intelligently balances wear stress across Flash memory cells, differentiating between “robust” and “fragile” NAND cells, optimizing write distribution to avoid localized over-erasing, and combining intelligent health monitoring to provide early warnings for potential risks and enhance the SSD lifespan.
Intelligent FSP algorithm: Through hardware-software co-design optimized for NAND characteristics, it effectively solves SSD performance degradation and data reliability decline during long-term use. The lowest bit error rate in the industry achieved by our FSP algorithm ensures end-of-life SSD reliability, ensuring that SSD’s performance fluctuation throughout its lifecycle is less than 10%.


Ø Fault tolerance and recovery mechanism
Built-in RAID-like algorithm: Based on intelligent RAID-like algorithms, it can recover data when medium data errors occur, ensuring that a single chip failure does not affect data integrity.
Flexible RAID algorithm: Upon Flash device failure, it actively recovers data from the failed Flash and continues to offer RAID protection for the data.
Power-loss protection: When the server experiences an unexpected power loss, the built-in capacitor maintains power supply during the power failure, ensuring complete data persistence in non-volatile memory. It prioritizes writing cached data to prevent model training interruption and model parameter loss.

Ø  Intelligent monitoring and predictive maintenance
Health state report: It provides real-time monitoring of remaining lifespan, temperature, IO statistics, bad block rate, and other indexes, and supports device diagnostics, monitoring, and SMART information reporting.
Data inspection technology: It periodically inspects errors, handles bad blocks, and verifies data. It performs data validation and check on the full-disk data in the background, effectively avoiding data corruption. If there is a risk of data corruption, it will promptly relocate these affected data, shield the affected Flash space, prevent erroneous data reading in business operations, and ensure data reliability, integrity, and device health.
NVMe-MI out-of-band management: It supports device management through out-of-band channels such as hardware/software state monitoring, host performance monitoring, SSD firmware upgrade and activation, and out-of-band business management.

Ø  Full-chain data protection
End-to-end data protection: It protects the data throughout the whole data path, supporting user data protection via DIF domains. Data is protected with check validation when it transfers between modules within the disk, which significantly reduces the risk of data loss and extends SSD lifespan in complex scenarios such as large model inference.
Advanced Flash access technology: It combines Read retry and Adaptive read technologies of the Flash memory to effectively ensure data validity.

Ø  Deep tuning and verification
Enterprise-level R&D lab: It conducts comprehensive testing and verification tasks, including software development, algorithms, chips, hardware, and software testing. Based on the three major industry standards (JEDEC, SNIA and OCP), it possesses strong product validation and deep tuning capabilities. Through multiple reliability verification tests, it ensures long-term reliability and stability of customers’ SSDs.
Full-process reliability verification: Through whitebox, greybox and blackbox testing, it guarantees software features, functionality, and reliability, with over 4,000 specialized reliability test cases accumulated. Additionally, it has built a compatibility-focused CI to continuously accumulate reliability testing strength, maintaining industry-leading levels in test coverage and stress levels.


In summary, achieving high reliability of ESSDs requires a combination of “hardware and software”. This involves meeting hard metrics (such as MTBF, UBER and AFR) and excelling in soft competencies (such as algorithm optimization, fault tolerance and recovery, high-standard testing and verification). By building a “zero data loss” reliable protection line, we can support the full-process requirements of large models from PB-level data training to millisecond-level inference response.
As an enterprise-level PCIe 5.0 benchmark product, UH812a/UH832a will unleash computational potential with a stable and reliable storage foundation, providing solid data storage infrastructure protection for customers and partners.


The News that you might be interested in

  • Last

    None.

  • Next

    “Chip” Knowledge You Must Read | Evolution History of NAND Flash Interfaces

     
  • Products
    ESSDs
    DSSDs
    CSSDs
    Embedded Storages
    Solutions
    Servers
    Data Centers
    PCs
    Mobile Terminals
    Smart Wearables
    Technical solution
    Cases
    Consumer Electronics
    Internet
    Finance
    Telecom
    Cloud Computing
    Big Data
    Technologies
    In-house Controllers
    Firmware Development
    Packaging and Testing
    Support
    Download
    Consulting
    About Us
    Company Profile
    Milestones
    Honors
    Corporate Culture
    Contact Us
    News and Events
    News
    Events
    Technical White Paper
    投资者关系
    公司治理
    管理团队
    财务报告
    最新公告
    Join Us
    Social Recruitment
    Campus Recruitment

    Shenzhen UnionMemory Information System Limited

    Address: 19th Floor, Block B, Ramaxel Houhai Center, Nanshan District, Shenzhen
    Tel: +86 755-2681 3300
    E-mail: support@unionmem.com

    Copyright © 2020-2025 Shenzhen UnionMemory Information System Limited. All Rights Reserved.  Terms of Use     Privacy Policy     Cookies