Erasure codes are one of those seemingly magical mathematical creations that with the developments described in the paper XORing Elephants: Novel Erasure Codes for Big Data, are set to replace triple replication as the data storage protection mechanism of choice.
The result says Robin Harris (StorageMojo) in an excellent article, Facebook’s advanced erasure codes: "WebCos will be able to store massive amounts of data more efficiently than ever before. Bad news: so will anyone else."
Robin says with cheap disks triple replication made sense and was economical. With ever bigger BigData the overhead has become costly. But erasure codes have always suffered from unacceptably long time to repair times. This paper describes new Locally Repairable Codes (LRCs) that are efficiently repairable in disk I/O and bandwidth requirements:
These systems are now designed to survive the loss of up to four storage elements – disks, servers, nodes or even entire data centers – without losing any data. What is even more remarkable is that, as this paper demonstrates, these codes achieve this reliability with a capacity overhead of only 60%.
They examined a large Facebook analytics Hadoop cluster of 3000 nodes with about 45 PB of raw capacity. On average about 22 nodes a day fail, but some days failures could spike to more than 100.
LRC test results found several key results.
- Disk I/O and network traffic were reduced by half compared to RS codes.
- The LRC required 14% more storage than RS, information theoretically optimal for the obtained locality.
- Repairs times were much lower thanks to the local repair codes.
- Much greater reliability thanks to fast repairs.
- Reduced network traffic makes them suitable for geographic distribution.
- LRC test results found several key results.
- Disk I/O and network traffic were reduced by half compared to RS codes.
I wonder if we'll see a change in NoSQL database systems as well?
Related Articles
- Erasure Coding vs. Replication: A Quantitative Comparison
- Ceph - a distributed object store.
Stack Exchange
This Q&A is part of a weekly series of posts highlighting common questions encountered by technophiles and answered by users at Stack Exchange, a free, community-powered network of 100+ Q&A sites.
Dokkat appears to think that databases are overused. "Instead of a database, I just serialize my data to JSON, saving and loading it to disk when necessary," he writes. "All the data management is made on the program itself, which is faster AND easier than using SQL queries." What is missing here? Why should a developer use a database when saving data to a disk might work just as well?
See the original question here.
- Andrew Reisse
- Ars Technica
- Atari
- Computing
- Cross-platform software
- Data
- Data management
- Database
- Database management systems
- Database theory
- information technology
- Intel
- JSON
- landfill site
- law
- Oracle
- radiation
- RDBMS
- Relational database management systems
- SQL
- SQL
- StackExchange
- Technology
- Technology Lab
- Technology Lab
- backside applications
- bank going
- bank keeps
- Codd
- Computing
- Cross-platform software
- Daniel Lemire
- Daniel Savard
- Data
- Data management
- data storage tools
- Database
- Database management systems
- Database theory
- datastore server
- DSL
- Erik Meijer
- IBM
- In-memory database
- internet protocol
- internet protocol
- John D. Cook
- LDAP
- Mike Swaim
- MySQL
- National Science Foundation
- nonSQL database products
- NoSQL
- NoSQL
- Null
- operating system
- RDBMS
- relational database
- Relational database management systems
- Robert Young
- Sabre
- SQL
- SQL
- SQL technology
- Steven H. Noble
- Technology
- Teradata
- Tom Davis
- travel reservation system
- Z39.50
Hi, I'm just starting with trying to build web applications and I'm not really familiar with how to store data to make sure something like this scales well.
Consider this pseudo-data, in JSON: { unique id, [array of child ids], [auxiliary data] }
Right now, every request I get requires me to get auxiliary info given an ID. About 3/4 of the time I need to update some of the auxiliary data. About half of the time, I need to query again for the auxiliary data of one of the "child" id's from the array of ids. This makes me want to just use a relational database and generate another query based on the child IDs if I need to traverse the natural graph structure of the data (do very "shallow" searches).
I'm wondering how well this would continue to work if I suddenly decided to do a lot more "depth-first" query patterns (that is, every query would likely be followed by a query to it's child, which has an unpredictable ID), and whether specialized graph databases (not SQL) would give me more scalability in this case. I don't actually know much about how they work but I imagine if there's any reason they exist it's for stuff like this.
Can anyone point me in the right direction? If a single request generates a chain of sequential SELECTs to traverse a graph am I doing it wrong?
submitted by ReallyGoodAdvice
[link] [3 comments]
- Alexey Vasiliev
- command line tools
- Cross-platform software
- Database management systems
- Database theory
- Databases
- ESRI
- geographic information systems
- GIN
- GiST
- image processing
- image processing
- Memcached
- Null
- Oracle
- PostGIS
- PostgreSQL
- Python
- RDBMS
- relational database
- Simple Features
- Spatial database
- SQL
- SQL
- Technology
- Ukraine
- web service
New submitter rescrv writes "Key-value stores (like Cassandra, Redis and DynamoDB) have been replacing traditional databases in many demanding web applications (e.g. Twitter, Google, Facebook, LinkedIn, and others). But for the most part, the differences between existing NoSQL systems come down to the choice of well-studied implementation techniques; in particular, they all provide a similar API that achieves high performance and scalability by limiting applications to simple operations like GET and PUT.
HyperDex, a new key-value store developed at Cornell, stands out in the NoSQL spectrum with its unique design. HyperDex employs a unique multi-dimensional hash function to enable efficient search operations — that is, objects may be retrieved without using the key (PDF) under which they are stored. Other systems employ indexing techniques to enable search, or enumerate all objects in the system. In contrast, HyperDex's design enables applications to retrieve search results directly from servers in the system. The results are impressive. Preliminary benchmark results on the project website show that HyperDex provides significant performance improvements over Cassandra and MongoDB. With its unique design, and impressive performance, it seems fittng to ask: Is HyperDex the start of NoSQL 2.0?"
Read more of this story at Slashdot.
- BASIC
- C
- cloud
- Computer science
- Computing
- Data Definition Language
- Data management
- Database
- Database management systems
- Database theory
- Databases
- IBM
- JSON
- MapReduce
- RDBMS
- Relational database
- relational database
- Relational database management system
- requisite car analogy
- retrieval algorithm
- retrieval algorithm
- SQL
- SQL
- storage systems
- Technology
Author: | Khayundi, Peter |
Issue Date: | 2009 |
Publisher: | University of Fort Hare, 2009 |
Abstract: | Object oriented databases have been gaining popularity over the years. Their ease of use and the advantages that they offer over relational databases have made them a popular choice amongst database administrators. Their use in previous years was restricted to business and administrative applications, but improvements in technology and the emergence of new, data-intensive applications has led to the increase in the use of object databases. This study investigates four Open Source object-oriented databases on their ability to carry out the standard database operations of storing, querying, updating and deleting database objects. Each of these databases will be timed in order to measure which is capable of performing a particular function faster than the other. |
Description: | Thesis (MSc)(Computer Science)-- University of Fort Hare, 2009 |
URI: | Link |
Appears in Collections: | Theses and Dissertations (Computer Science) |
Files in This Item:
|