Skip navigation
Help

Scalability

warning: Creating default object from empty value in /var/www/vhosts/sayforward.com/subdomains/recorder/httpdocs/modules/taxonomy/taxonomy.pages.inc on line 33.

Ever wonder what powers Google's world spirit sensing Zeitgeist service? No, it's not a homunculus of Georg Wilhelm Friedrich Hegel sitting in each browser. It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring.

Abstract:

MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees.

 

This paper describes MillWheel’s programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel’s features are used. MillWheel’s programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel’s unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.

0
Your rating: None

Awesome paper on how particular synchronization mechanisms scale on multi-core architectures: Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask.

The goal is to pick a locking approach that doesn't degrade as the number of cores increase. Like everything else in life, that doesn't appear to be generically possible:

None of the nine locking schemes we consider consistently outperforms any other one, on all target architectures or workloads. Strictly speaking, to seek optimality, a lock algorithm should thus be selected based on the hardware platform and the expected workload

Abstract:

This paper presents the most exhaustive study of synchronization to date. We span multiple layers, from hardware cache-coherence protocols up to high-level concurrent software. We do so on different types architectures, from single-socket – uniform and nonuniform – to multi-socket – directory and broadcastbased – many-cores. We draw a set of observations that, roughly speaking, imply that scalability of synchronization is mainly a property of the hardware.

Some Findings:

0
Your rating: None

Guest post by Thierry Schellenbach, Founder/CTO of Fashiolista.com, follow @tschellenbach on Twitter and Github

Fashiolista started out as a hobby project which we built on the side. We had absolutely no idea it would grow into one of the largest online fashion communities. The entire first version took about two weeks to develop and our feed implementation was dead simple. We’ve come a long way since then and I’d like to share our experience with scaling feed systems.

Feeds are a core component of many large startups such as Pinterest, Instagram, Wanelo and Fashiolista. At Fashiolista the feed system powers the flat feed, aggregated feed and the notification system. This article will explain the troubles we ran into when scaling our feeds and the design decisions involved with building your own solution. Understanding the basics of how these feed systems work is essential as more and more applications rely on them.

Furthermore we’ve open sourced Feedly, the Python module powering our feeds. Where applicable I’ll reference how to use it to quickly build your own feed solution.

0
Your rating: None
Original author: 
Todd Hoff

The paper MegaPipe: A New Programming Interface for Scalable Network I/O (video, slides) hits the common theme that if you want to go faster you need a better car design, not just a better driver. So that's why the authors started with a clean-slate and designed a network API from the ground up with support for concurrent I/O, a requirement for achieving high performance while scaling to large numbers of connections per thread, multiple cores, etc.  What they created is MegaPipe, "a new network programming API for message-oriented workloads to avoid the performance issues of BSD Socket API."

The result: MegaPipe outperforms baseline Linux between 29% (for long connections) and 582% (for short connections). MegaPipe improves the performance of a modified version of memcached between 15% and 320%. For a workload based on real-world HTTP traces, MegaPipe boosts the throughput of nginx by 75%.

What's this most excellent and interesting paper about?

0
Your rating: None
Original author: 
Todd Hoff

In the moral realm there may be 7 deadly sins, but scalability maven Sean Hull has come up Five More Things Deadly to Scalability that when added to his earlier 5 Things That are Toxic to Scalability, make for a numerologically satisfying 10 sins again scalability:

  1. Slow Disk I/O – RAID 5 – Multi-tenant EBS. Use RAID 10, it provides  good protection along with good read and write performance. The design of RAID 5 means poor performance and long repair times on failure. On AWS consider Provisioned IOPS as a way around IO bottlenecks.
  2. Using the database for Queuing. The database may seem like the perfect place to keep work queues, but under load locking and scanning overhead kills performance. Use specialized products like RabbitMQ and SQS to remove this bottleneck.
  3. Using Database for full-text searching. Search seems like another perfect database feature. At scale search doesn't perform well. Use specialized technologies like Solr or Sphinx.
  4. Insufficient Caching at all layers. Use memcache between your application and the database. Use a page like cache like Varnish between users and your webserver. Select proper caching options for your html assets.
  5. Too much technical debt. Rewrite problem code instead of continually paying a implementation tax for poorly written code. In the long run it pays off.
  6. Object Relational Mappers. Create complex queries that hard to optimize and tweak.
  7. Synchronous, Serial, Coupled or Locking Processes. Locks are like stop signs, traffic circles keep the traffic flowing. Row level locking is better than table level locking. Use async replication. Use eventual consistency for clusters.
  8. One Copy of Your Database. A single database server is a choke point. Create parallel databases and let a driver select between them.
  9. Having No Metrics. Visualize what's happening to your system using one of the many monitoring packages.
  10. Lack of Feature Flags. Be able to turn off features via a flag so when a spike hits features can be turned off to reduce load.
0
Your rating: None
Original author: 
Todd Hoff

It's not often you get so enthusiastic a recommendation for a paper as Sergio Bossa gives Memory Barriers: a Hardware View for Software Hackers: If you only want to read one piece about CPUs architecture, cache coherency and memory barriers, make it this one.

It is a clear and well written article. It even has a quiz. What's it about?

So what possessed CPU designers to cause them to inflict memory barriers on poor unsuspecting SMP software designers?

In short, because reordering memory references allows much better performance, and so memory barriers are needed to force ordering in things like synchronization primitives whose correct operation depends on ordered memory references.

Getting a more detailed answer to this question requires a good understanding of how CPU caches work, and especially what is required to make caches really work well. The following sections:

  1. present the structure of a cache,
  2. describe how cache-coherency protocols ensure that CPUs agree on the value of each location in memory, and, finally,
  3. outline how store buffers and invalidate queues help caches and cache-coherency protocols achieve high performance.

We will see that memory barriers are a necessary evil that is required to enable good performance and scalability, an evil that stems from the fact that CPUs are orders of magnitude faster than are both the interconnects between them and the memory they are attempting to access.

0
Your rating: None
Original author: 
Todd Hoff

Now that we have the C10K concurrent connection problem licked, how do we level up and support 10 million concurrent connections? Impossible you say. Nope, systems right now are delivering 10 million concurrent connections using techniques that are as radical as they may be unfamiliar.

To learn how it’s done we turn to Robert Graham, CEO of Errata Security, and his absolutely fantastic talk at Shmoocon 2013 called C10M Defending The Internet At Scale.

Robert has a brilliant way of framing the problem that I’ve never heard of before. He starts with a little bit of history, relating how Unix wasn’t originally designed to be a general server OS, it was designed to be a control system for a telephone network. It was the telephone network that actually transported the data so there was a clean separation between the control plane and the data plane. The problem is we now use Unix servers as part of the data plane, which we shouldn’t do at all. If we were designing a kernel for handling one application per server we would design it very differently than for a multi-user kernel. 

Which is why he says the key is to understand:

  • The kernel isn’t the solution. The kernel is the problem.

Which means:

  • Don’t let the kernel do all the heavy lifting. Take packet handling, memory management, and processor scheduling out of the kernel and put it into the application, where it can be done efficiently. Let Linux handle the control plane and let the the application handle the data plane.

The result will be a system that can handle 10 million concurrent connections with 200 clock cycles for packet handling and 1400 hundred clock cycles for application logic. As a main memory access costs 300 clock cycles it’s key to design in way that minimizes code and cache misses.

With a data plane oriented system you can process 10 million packets per second. With a control plane oriented system you only get 1 million packets per second.

If this seems extreme keep in mind the old saying: scalability is specialization. To do something great you can’t outsource performance to the OS. You have to do it yourself.

Now, let’s learn how Robert creates a system capable of handling 10 million concurrent connections...

0
Your rating: None

New submitter jhantin writes "The Bitcoin blockchain has forked due to a lurking backward-compatibility issue: versions older than 0.8 do not properly handle blocks larger than about 500k, and Slush's pool mined a 974k block today. The problem is that not all mining operations are on 0.8; blocks are being generated by a mix of several different versions of the daemon, each making its own decision as to which of the two forks is preferable to extend, and older versions refuse to honor or extend from a block of this size. The consensus on #bitcoin-dev is damage control: miners need to mine on pre-0.8 code so the backward-compatible fork will outgrow and thus dominate the compatibility-breaking one; merchants need to stop accepting transactions until the network re-converges on the backward-compatible fork of the chain; and average users can ignore the warning that they are out of sync and need to upgrade." Turns out there's an approximately 512K limit to atomic updates in Berkeley DB which were used by versions prior to 0.8. 0.8 uses a new database, allowing blockchains that old versions won't accept to be created.

Share on Google+

Read more of this story at Slashdot.

0
Your rating: None