Skip navigation
Help

Data management

warning: Creating default object from empty value in /var/www/vhosts/sayforward.com/subdomains/recorder/httpdocs/modules/taxonomy/taxonomy.pages.inc on line 33.

Jeremy Edberg, the first paid employee at reddit, teaches us a lot about how to create a successful social site in a really good talk he gave at the RAMP conference. Watch it here at Scaling Reddit from 1 Million to 1 Billion–Pitfalls and Lessons.

Jeremy uses a virtue and sin approach. Examples of the mistakes made in scaling reddit are shared and it turns out they did a lot of good stuff too. Somewhat of a shocker is that Jeremy is now a Reliability Architect at Netflix, so we get a little Netflix perspective thrown in for free.

Some of the lessons that stood out most for me: 

0
Your rating: None

With both the F1 and Spanner papers out its now possible to understand their interplay a bit holistically. So lets start by revisiting the key goals of both systems.

0
Your rating: None

Abstract
This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage and query data stored in a Hadoop
cluster using the standard SQL query language. Unlike other database systems that provide only a relational view over HDFSresident data through the use of an external table mechanism, Polybase employs a split query processing paradigm in which
SQL operators on HDFS-resident data are translated into MapReduce jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper describes the design and implementation of Polybase along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop. Our results demonstrate that while the use of a splitbased query execution paradigm can improve the performance of some queries by as much as 10X, one must employ a cost-based query optimizer that considers a broad set of factors when deciding whether or not it is advantageous to push a SQL operator to Hadoop. These factors include the selectivity factor of the predicate, the relative sizes of the two clusters, and whether or not their nodes are co-located. In addition, differences in the semantics of the Java and SQL languages must be carefully considered in order to avoid altering the expected results of a query.

Link to the paper

0
Your rating: None

Guest post by Thierry Schellenbach, Founder/CTO of Fashiolista.com, follow @tschellenbach on Twitter and Github

Fashiolista started out as a hobby project which we built on the side. We had absolutely no idea it would grow into one of the largest online fashion communities. The entire first version took about two weeks to develop and our feed implementation was dead simple. We’ve come a long way since then and I’d like to share our experience with scaling feed systems.

Feeds are a core component of many large startups such as Pinterest, Instagram, Wanelo and Fashiolista. At Fashiolista the feed system powers the flat feed, aggregated feed and the notification system. This article will explain the troubles we ran into when scaling our feeds and the design decisions involved with building your own solution. Understanding the basics of how these feed systems work is essential as more and more applications rely on them.

Furthermore we’ve open sourced Feedly, the Python module powering our feeds. Where applicable I’ll reference how to use it to quickly build your own feed solution.

0
Your rating: None
Original author: 
Todd Hoff

Erasure codes are one of those seemingly magical mathematical creations that with the developments described in the paper XORing Elephants: Novel Erasure Codes for Big Data, are set to replace triple replication as the data storage protection mechanism of choice.

The result says Robin Harris (StorageMojo) in an excellent article, Facebook’s advanced erasure codes: "WebCos will be able to store massive amounts of data more efficiently than ever before. Bad news: so will anyone else."

Robin says with cheap disks triple replication made sense and was economical. With ever bigger BigData the overhead has become costly. But erasure codes have always suffered from unacceptably long time to repair times. This paper describes new Locally Repairable Codes (LRCs) that are efficiently repairable in disk I/O and bandwidth requirements:

These systems are now designed to survive the loss of up to four storage elements – disks, servers, nodes or even entire data centers – without losing any data. What is even more remarkable is that, as this paper demonstrates, these codes achieve this reliability with a capacity overhead of only 60%.

They examined a large Facebook analytics Hadoop cluster of 3000 nodes with about 45 PB of raw capacity. On average about 22 nodes a day fail, but some days failures could spike to more than 100.

LRC test results found several key results.

  • Disk I/O and network traffic were reduced by half compared to RS codes.
  • The LRC required 14% more storage than RS, information theoretically optimal for the obtained locality.
  • Repairs times were much lower thanks to the local repair codes.
  • Much greater reliability thanks to fast repairs.
  • Reduced network traffic makes them suitable for geographic distribution.
  • LRC test results found several key results.
  • Disk I/O and network traffic were reduced by half compared to RS codes.

I wonder if we'll see a change in NoSQL database systems as well? 

Related Articles

0
Your rating: None
Original author: 
Lee Aylward


One of Ars Technica's many memcached server graphs. Look at all those misses!

This week, memcached, a piece of software that prevents much of the Internet from melting down, turns 10 years old. Despite its age, memcached is still the go-to solution for many programmers and sysadmins managing heavy workloads. Without memcached, Ars Technica would likely be unable to serve this article to you at all.

Brad Fitzpatrick wrote memcached for LiveJournal way back in 2003 (check out the initial CVS commit here). While waiting for new hardware to help save the site from being overloaded, Fitzpatrick realized that he had plenty of unused RAM spread across LiveJournal's existing servers. He wrote memcached to take advantage of this spare memory and lighten the load on the site.

memcached is a distributed in-memory key-value store that uses a very simple protocol for storing and retrieving arbitrary data from memory instead of from a filesystem. To store a value, a program connects to the memcached server on the default port of 11211 and issues a series of basic commands. (Note: a binary protocol is also supported.)

Read 4 remaining paragraphs | Comments

0
Your rating: None