Skip navigation
Help

Parallel computing

warning: Creating default object from empty value in /var/www/vhosts/sayforward.com/subdomains/recorder/httpdocs/modules/taxonomy/taxonomy.pages.inc on line 33.

Ever wonder what powers Google's world spirit sensing Zeitgeist service? No, it's not a homunculus of Georg Wilhelm Friedrich Hegel sitting in each browser. It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring.

Abstract:

MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees.

 

This paper describes MillWheel’s programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel’s features are used. MillWheel’s programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel’s unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.

0
Your rating: None

Awesome paper on how particular synchronization mechanisms scale on multi-core architectures: Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask.

The goal is to pick a locking approach that doesn't degrade as the number of cores increase. Like everything else in life, that doesn't appear to be generically possible:

None of the nine locking schemes we consider consistently outperforms any other one, on all target architectures or workloads. Strictly speaking, to seek optimality, a lock algorithm should thus be selected based on the hardware platform and the expected workload

Abstract:

This paper presents the most exhaustive study of synchronization to date. We span multiple layers, from hardware cache-coherence protocols up to high-level concurrent software. We do so on different types architectures, from single-socket – uniform and nonuniform – to multi-socket – directory and broadcastbased – many-cores. We draw a set of observations that, roughly speaking, imply that scalability of synchronization is mainly a property of the hardware.

Some Findings:

0
Your rating: None
Original author: 
Todd Hoff

This is a guest post by Yelp's Jim Blomo. Jim manages a growing data mining team that uses Hadoop, mrjob, and oddjob to process TBs of data. Before Yelp, he built infrastructure for startups and Amazon. Check out his upcoming talk at OSCON 2013 on Building a Cloud Culture at Yelp.

In Q1 2013, Yelp had 102 million unique visitors (source: Google Analytics) including approximately 10 million unique mobile devices using the Yelp app on a monthly average basis. Yelpers have written more than 39 million rich, local reviews, making Yelp the leading local guide on everything from boutiques and mechanics to restaurants and dentists. With respect to data, one of the most unique things about Yelp is the variety of data: reviews, user profiles, business descriptions, menus, check-ins, food photos... the list goes on.  We have many ways to deal data, but today I’ll focus on how we handle offline data processing and analytics.

In late 2009, Yelp investigated using Amazon’s Elastic MapReduce (EMR) as an alternative to an in-house cluster built from spare computers.  By mid 2010, we had moved production processing completely to EMR and turned off our Hadoop cluster.  Today we run over 500 jobs a day, from integration tests to advertising metrics.  We’ve learned a few lessons along the way that can hopefully benefit you as well.

Job Flow Pooling

0
Your rating: None
Original author: 
Todd Hoff

It's not often you get so enthusiastic a recommendation for a paper as Sergio Bossa gives Memory Barriers: a Hardware View for Software Hackers: If you only want to read one piece about CPUs architecture, cache coherency and memory barriers, make it this one.

It is a clear and well written article. It even has a quiz. What's it about?

So what possessed CPU designers to cause them to inflict memory barriers on poor unsuspecting SMP software designers?

In short, because reordering memory references allows much better performance, and so memory barriers are needed to force ordering in things like synchronization primitives whose correct operation depends on ordered memory references.

Getting a more detailed answer to this question requires a good understanding of how CPU caches work, and especially what is required to make caches really work well. The following sections:

  1. present the structure of a cache,
  2. describe how cache-coherency protocols ensure that CPUs agree on the value of each location in memory, and, finally,
  3. outline how store buffers and invalidate queues help caches and cache-coherency protocols achieve high performance.

We will see that memory barriers are a necessary evil that is required to enable good performance and scalability, an evil that stems from the fact that CPUs are orders of magnitude faster than are both the interconnects between them and the memory they are attempting to access.

0
Your rating: None
Original author: 
Peter Bright

AMD

AMD wants to talk about HSA, Heterogeneous Systems Architecture (HSA), its vision for the future of system architectures. To that end, it held a press conference last week to discuss what it's calling "heterogeneous Uniform Memory Access" (hUMA). The company outlined what it was doing, and why, both confirming and reaffirming the things it has been saying for the last couple of years.

The central HSA concept is that systems will have multiple different kinds of processors, connected together and operating as peers. The two main kinds of processors are conventional: versatile CPUs and the more specialized GPUs.

Modern GPUs have enormous parallel arithmetic power, especially floating point arithmetic, but are poorly-suited to single-threaded code with lots of branches. Modern CPUs are well-suited to single-threaded code with lots of branches, but less well-suited to massively parallel number crunching. Splitting workloads between a CPU and a GPU, using each for the workloads it's good at, has driven the development of general purpose GPU (GPGPU) software and development.

Read 21 remaining paragraphs | Comments

0
Your rating: None


IBM's integrated silicon nanophotonics transceiver on a chip; optical waveguides are highlighted here in blue, and the copper conductors of the electronic components in yellow.

IBM Research

IBM has developed a technology that integrates optical communications and electronics in silicon, allowing optical interconnects to be integrated directly with integrated circuits in a chip. That technology, called silicon nanophotonics, is now moving out of the labs and is ready to become a product. It could potentially revolutionize how processors, memory, and storage in supercomputers and data centers interconnect.

Silicon nanophotonics were first demonstrated by IBM in 2010 as part of IBM Research's efforts to build Blue Waters, the NCSA supercomputer project that the company withdrew from in 2011. But IBM Research continued to develop the technology, and today announced that it was ready for mass production. For the first time, the technology "has been verified and manufactured in a 90-nanometer CMOS commercial foundry," Dr. Solomon Assefa, Nanophotonics Scientist for IBM Research, told Ars.

A single CMOS-based nanophotonic transceiver is capable of converting data between electric and optical with virtually no latency, handling a data connection of more than 25 gigabits per second. Depending on the application, hundreds of transceivers could be integrated into a single CMOS chip, pushing terabits of data over fiber-optic connections between processors, memory, and storage systems optically over distances ranging from two centimeters to two kilometers.

Read 5 remaining paragraphs | Comments

0
Your rating: None