Ever wonder what powers Google's world spirit sensing Zeitgeist service? No, it's not a homunculus of Georg Wilhelm Friedrich Hegel sitting in each browser. It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring.
Abstract:
MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees.
This paper describes MillWheel’s programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel’s features are used. MillWheel’s programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel’s unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
Guest post by Thierry Schellenbach, Founder/CTO of Fashiolista.com, follow @tschellenbach on Twitter and Github
Fashiolista started out as a hobby project which we built on the side. We had absolutely no idea it would grow into one of the largest online fashion communities. The entire first version took about two weeks to develop and our feed implementation was dead simple. We’ve come a long way since then and I’d like to share our experience with scaling feed systems.
Feeds are a core component of many large startups such as Pinterest, Instagram, Wanelo and Fashiolista. At Fashiolista the feed system powers the flat feed, aggregated feed and the notification system. This article will explain the troubles we ran into when scaling our feeds and the design decisions involved with building your own solution. Understanding the basics of how these feed systems work is essential as more and more applications rely on them.
Furthermore we’ve open sourced Feedly, the Python module powering our feeds. Where applicable I’ll reference how to use it to quickly build your own feed solution.
- Apache Cassandra
- API
- Cross-platform software
- Data management
- Database management systems
- database solution
- Distributed computing architecture
- Fashiolista.com
- feed systems
- GitHub
- graph-database backed feed algorithm
- graph-database backed feed algorithm
- Graphity algorithm
- NoSQL
- online fashion communities
- PostgreSQL
- Python
- RAM
- real live applications
- Redis
- Rene Pickhardt
- Scalability
- Sentinel
- Shard
- Structured storage
- Technology
- Thierry Schellenbach
- Twemproxy
- using Twemproxy
- Wanelo
Jeff Dean talk on large distributed systems at the University of Minnesota (unite.umn.edu)
submitted 1 month ago by aterlumen
- ACM-Infosys Foundation
- BigTable
- Computing
- Concurrent computing
- Distributed computing architecture
- distributed computing infrastructure
- flash
- Jeff
- large distributed systems
- large-scale distributed systems
- large-scale web services
- machine learning
- machine translation
- MapReduce
- Minnesota
- neural networks
- Parallel computing
- search parameters
- speech recognition
- Technology
- the ACM-Infosys Foundation Award
- the Mark Weiser Award
- U.S. National Academy of Engineering
This is a guest post by Yelp's Jim Blomo. Jim manages a growing data mining team that uses Hadoop, mrjob, and oddjob to process TBs of data. Before Yelp, he built infrastructure for startups and Amazon. Check out his upcoming talk at OSCON 2013 on Building a Cloud Culture at Yelp.
In Q1 2013, Yelp had 102 million unique visitors (source: Google Analytics) including approximately 10 million unique mobile devices using the Yelp app on a monthly average basis. Yelpers have written more than 39 million rich, local reviews, making Yelp the leading local guide on everything from boutiques and mechanics to restaurants and dentists. With respect to data, one of the most unique things about Yelp is the variety of data: reviews, user profiles, business descriptions, menus, check-ins, food photos... the list goes on. We have many ways to deal data, but today I’ll focus on how we handle offline data processing and analytics.
In late 2009, Yelp investigated using Amazon’s Elastic MapReduce (EMR) as an alternative to an in-house cluster built from spare computers. By mid 2010, we had moved production processing completely to EMR and turned off our Hadoop cluster. Today we run over 500 jobs a day, from integration tests to advertising metrics. We’ve learned a few lessons along the way that can hopefully benefit you as well.
Job Flow Pooling
- adapti ve infrastructure
- advertising metrics
- Amazon
- Apache Hadoop
- cloud
- Cloud computing
- Cloud infrastructure
- cloud solution
- cloud technology
- cloud technology
- Computer cluster
- Computing
- Concurrent computing
- data
- data mining
- data mining team
- Distributed computing architecture
- Elastic MapReduce
- Example
- food photos
- Hadoop
- in-house solution
- Jim Blomo
- Job Flow
- MapReduce
- mobile devices
- mobile devices
- offline batch processing
- offline data processing
- Parallel computing
- production processing
- Python
- SOA
- US West
- using Amazon’s Elastic MapReduce
- ve shipping products
- Airbnb
- Amazon
- Amazon Elastic Compute Cloud
- Amazon Relational Database Service
- API
- Arizona
- Asia
- Cloud computing
- Cloud storage
- Computing
- Cross-platform software
- data backup solution
- Data management
- Distributed computing architecture
- document search
- ec2 tools
- Java
- mailed media
- MySQL
- MySQL
- RAM
- read-intensive applications
- read-only slave server
- relational database
- Relational Database Service
- Replication
- Shard
- Technology
- web service
- Web services
Unity is Client-Server in its networking design. A peer-to-peer system, though, can be developed on top of Unity's existing networking framework. In this article, I'll demonstrate how to do Server Discovery in Unity.
- client server
- client-server networking model
- Client–server model
- Computing
- Direct Client-to-Client
- Distributed computing architecture
- LAN
- lower-level networking work
- Mobile software development
- Peer-to-peer
- peer-to-peer
- peer-to-peer system
- Programming Indie
- Server
- Software
- Technology
- Transmission Control Protocol
- Windows Server
IBM's integrated silicon nanophotonics transceiver on a chip; optical waveguides are highlighted here in blue, and the copper conductors of the electronic components in yellow.
IBM Research
IBM has developed a technology that integrates optical communications and electronics in silicon, allowing optical interconnects to be integrated directly with integrated circuits in a chip. That technology, called silicon nanophotonics, is now moving out of the labs and is ready to become a product. It could potentially revolutionize how processors, memory, and storage in supercomputers and data centers interconnect.
Silicon nanophotonics were first demonstrated by IBM in 2010 as part of IBM Research's efforts to build Blue Waters, the NCSA supercomputer project that the company withdrew from in 2011. But IBM Research continued to develop the technology, and today announced that it was ready for mass production. For the first time, the technology "has been verified and manufactured in a 90-nanometer CMOS commercial foundry," Dr. Solomon Assefa, Nanophotonics Scientist for IBM Research, told Ars.
A single CMOS-based nanophotonic transceiver is capable of converting data between electric and optical with virtually no latency, handling a data connection of more than 25 gigabits per second. Depending on the application, hundreds of transceivers could be integrated into a single CMOS chip, pushing terabits of data over fiber-optic connections between processors, memory, and storage systems optically over distances ranging from two centimeters to two kilometers.
- 90-nanometer chips
- actual product
- Ars Technica
- Baltimore
- Cloud computing data centers
- CMOS
- CMOS chip
- Computing
- Concurrent computing
- Distributed computing architecture
- Electronics
- electronics
- Hadoop
- IBM
- image processing
- image processing
- information technology
- Integrated circuit
- integrated circuits
- interconnected networks
- mainframe systems
- Maryland
- Nanophotonics
- network systems integrator
- optic data
- optical communications
- Optics
- Parallel computing
- Photonics
- public cloud computing
- redundant storage networking
- Sean Gallagher
- Silicon photonics
- social networking data centers
- Solomon Assefa
- storage systems
- Supercomputer
- Technology
- Technology Lab
- Technology Lab
- universal technology
New submitter happyscientist writes "This is a nice 'Hello World' for using Hadoop MapReduce on Common Crawl data. I was interested when Common Crawl announced themselves a few weeks ago, but I was hesitant to dive in. This is a good video/example that makes it clear how easy it is to start playing with the crawl data."
Read more of this story at Slashdot.