Erasure Coding Demystified

197 Views
Published
Data center operators are looking for new ways to ensure data is never lost while also keeping costs down. In this video, we define and explain what erasure coding is and why it is the solution for fault-tolerant, high availability storage.

To learn more, visit: https://www.fungible.com/

Follow us on Twitter: https://twitter.com/fungible_inc
Follow us on LinkedIn: https://www.linkedin.com/company/fungible-inc./

#datacenter #technology #cloud #network #data #datacentric #datacenterarchitecture #datacentricsw #softwarearchitecture #fungible_inc


Data is inherently valuable in today's digitally transformed and connected world. Similar to your family photos works of art or your medical history. Some data is precious irreplaceable, some data needs to be always available. These mission-critical data are needed in real-time and all the time for applications such as banking, stock trading, autonomous driving, etc. These requirements drive the need for fault-tolerant high durability and high availability storage. Today a common metric to represent durability and availability is using the number of nines. As an example for durability five nines represent losing ten thousand files out of ten million files per year and nine nines represent losing just one file every 100 years. For availability five nines translates to about five minutes of downtime per year and nine nines translate to just 30 milliseconds to achieve a high number of nines storage systems must have high mean time between failures and when a failure occurs the systems must have fast recovery times or low mean time to recover. Leading to high mean time to data loss to understand mean time to data loss we must first understand failure domains for example the storage client the network the storage target and within the storage target the storage controller and the disks are all different failure domains local faults and failures are inevitable but to ensure that these faults and failures do not impact overall durability and availability from a user's perspective some type of redundancy mechanism is typically put in place zooming in on the storage target one of the simplest strategies to protect data and maintain high availability is to replicate identical copies of the data this is also commonly known as mirroring it is important that the redundant copies live in different failure domains so that errors are confined within the domain and do not impact the other domains this technique requires very little processing or compute capabilities but clearly this is an expensive solution a different technique is through the use of a parity code in the most basic example let's assume the parity code is the XOR of D 1 and D 2 by storing the results of the XOR we can reconstruct either d1 or d2 based on the results of XOR here is a further generalization of this concept where data is divided into four chunks and two different linear equations are used to form two unique parity codes using this scheme if d3 and d4 go down the data can be reconstructed by solving the linear equations represented in the parity codes fundamentally all you need is basic math skills in linear algebra however for practical implementation extrapolated to large amounts of data you use a branch of algebra called Galois field this scheme is essentially the foundation of erasure coding data is cut into n chunks and the parodies are erasure codes the number of parody blocks correspond to the number of failures the system can resist in this example where N equals 4 and P equals 2 the system can tolerate up to two failures similar to triple replication but only requires an overhead capacity of 50 percent versus 200% for triple replication let's look at a quantitative example this example shows how both schemes can sustain two node failures and achieve similar durability metrics and in this scheme where P equals four the system can sustain for failure nodes but now compare the T co-benefits with triple replication your system can sustain to failure nodes with erasure coding where n is four and P is two you can sustain to failure nodes achieve similar high durability but at half the storage capacity versus replication.
Category
Network Storage
Be the first to comment