Ever wonder what powers Google's world spirit sensing Zeitgeist service? No, it's not a homunculus of Georg Wilhelm Friedrich Hegel sitting in each browser. It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring.
Abstract:
MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees.
This paper describes MillWheel’s programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel’s features are used. MillWheel’s programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel’s unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
Abstract
This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage and query data stored in a Hadoop
cluster using the standard SQL query language. Unlike other database systems that provide only a relational view over HDFSresident data through the use of an external table mechanism, Polybase employs a split query processing paradigm in which
SQL operators on HDFS-resident data are translated into MapReduce jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper describes the design and implementation of Polybase along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop. Our results demonstrate that while the use of a splitbased query execution paradigm can improve the performance of some queries by as much as 10X, one must employ a cost-based query optimizer that considers a broad set of factors when deciding whether or not it is advantageous to push a SQL operator to Hadoop. These factors include the selectivity factor of the predicate, the relative sizes of the two clusters, and whether or not their nodes are co-located. In addition, differences in the semantics of the Java and SQL languages must be carefully considered in order to avoid altering the expected results of a query.
Jeff Dean talk on large distributed systems at the University of Minnesota (unite.umn.edu)
submitted 1 month ago by aterlumen
- ACM-Infosys Foundation
- BigTable
- Computing
- Concurrent computing
- Distributed computing architecture
- distributed computing infrastructure
- flash
- Jeff
- large distributed systems
- large-scale distributed systems
- large-scale web services
- machine learning
- machine translation
- MapReduce
- Minnesota
- neural networks
- Parallel computing
- search parameters
- speech recognition
- Technology
- the ACM-Infosys Foundation Award
- the Mark Weiser Award
- U.S. National Academy of Engineering
This is a guest post by Yelp's Jim Blomo. Jim manages a growing data mining team that uses Hadoop, mrjob, and oddjob to process TBs of data. Before Yelp, he built infrastructure for startups and Amazon. Check out his upcoming talk at OSCON 2013 on Building a Cloud Culture at Yelp.
In Q1 2013, Yelp had 102 million unique visitors (source: Google Analytics) including approximately 10 million unique mobile devices using the Yelp app on a monthly average basis. Yelpers have written more than 39 million rich, local reviews, making Yelp the leading local guide on everything from boutiques and mechanics to restaurants and dentists. With respect to data, one of the most unique things about Yelp is the variety of data: reviews, user profiles, business descriptions, menus, check-ins, food photos... the list goes on. We have many ways to deal data, but today I’ll focus on how we handle offline data processing and analytics.
In late 2009, Yelp investigated using Amazon’s Elastic MapReduce (EMR) as an alternative to an in-house cluster built from spare computers. By mid 2010, we had moved production processing completely to EMR and turned off our Hadoop cluster. Today we run over 500 jobs a day, from integration tests to advertising metrics. We’ve learned a few lessons along the way that can hopefully benefit you as well.
Job Flow Pooling
- adapti ve infrastructure
- advertising metrics
- Amazon
- Apache Hadoop
- cloud
- Cloud computing
- Cloud infrastructure
- cloud solution
- cloud technology
- cloud technology
- Computer cluster
- Computing
- Concurrent computing
- data
- data mining
- data mining team
- Distributed computing architecture
- Elastic MapReduce
- Example
- food photos
- Hadoop
- in-house solution
- Jim Blomo
- Job Flow
- MapReduce
- mobile devices
- mobile devices
- offline batch processing
- offline data processing
- Parallel computing
- production processing
- Python
- SOA
- US West
- using Amazon’s Elastic MapReduce
- ve shipping products
- advertisement management
- Amazon
- Computing
- Concurrent computing
- gigabit Ethernet
- Google Chrome
- Google LAB
- Greg Linden
- higher scaling infrastructure
- Human–computer interaction
- internet scale applications
- MapReduce
- Microsoft
- Parallel computing
- Portable software
- search engine
- search engine
- simplified data processing
- Software
- software
- Technology
- web server
- web server
- work product
- World Wide Web
- Algorithm
- API
- Artificial intelligence
- artificial intelligence
- be done dynamically using the Mendeley API
- Blei
- Boyd
- Carl Boettiger
- Cybernetics
- intelligent agent
- JSON
- Khan Academy
- Lang
- Larry Page
- Machine learning
- machine learning
- MapReduce
- Mendeley
- Neural network
- Neural Networks
- Paper
- recommendation algorithms
- recommendation algorithms
- Reference management software
- Sergey Brin
- Shannon
- signal processing
- Stanford
- Technology
- Theoretical computer science
- Tim Berners-Lee
- YouTube
The venture capital firm funded by Google is building up its data sciences team to increase the capabilities inside its companies and to look for new investments in the area. The firm is extending a thesis that was developed inside Google about finding patterns in big collections of data, which it hopes will work in other industries.
- Adam
- advertising
- Amazon
- Amazon
- Andreessen Horowitz
- Bet April
- Bill Maris
- biotech
- Computing
- Concurrent computing
- data
- database software
- Finance
- Google Ventures
- Google Ventures
- Greylock Partners
- Greylock Partners
- Hadoop
- Hazem Adam Ghobarah
- Human-computer interaction
- Internet sectors
- life sciences
- MapReduce
- Mobile Payment
- One Google Ventures
- Parallel computing
- Patil
- Reid Hoffman
- start-ups
- Technology
- Venture capital
- venture capital entity
- web traffic
- World Wide Web
- Yahoo
- Yahoo!
New submitter rescrv writes "Key-value stores (like Cassandra, Redis and DynamoDB) have been replacing traditional databases in many demanding web applications (e.g. Twitter, Google, Facebook, LinkedIn, and others). But for the most part, the differences between existing NoSQL systems come down to the choice of well-studied implementation techniques; in particular, they all provide a similar API that achieves high performance and scalability by limiting applications to simple operations like GET and PUT.
HyperDex, a new key-value store developed at Cornell, stands out in the NoSQL spectrum with its unique design. HyperDex employs a unique multi-dimensional hash function to enable efficient search operations — that is, objects may be retrieved without using the key (PDF) under which they are stored. Other systems employ indexing techniques to enable search, or enumerate all objects in the system. In contrast, HyperDex's design enables applications to retrieve search results directly from servers in the system. The results are impressive. Preliminary benchmark results on the project website show that HyperDex provides significant performance improvements over Cassandra and MongoDB. With its unique design, and impressive performance, it seems fittng to ask: Is HyperDex the start of NoSQL 2.0?"
Read more of this story at Slashdot.
- BASIC
- C
- cloud
- Computer science
- Computing
- Data Definition Language
- Data management
- Database
- Database management systems
- Database theory
- Databases
- IBM
- JSON
- MapReduce
- RDBMS
- Relational database
- relational database
- Relational database management system
- requisite car analogy
- retrieval algorithm
- retrieval algorithm
- SQL
- SQL
- storage systems
- Technology
The Web of Data is built upon two simple ideas: First, to employ the RDF
data model to publish structured data on the Web. Second, to set explicit RDF links
between data items within different data sources. Background information about the Web of Data is found at the wiki pages of the W3C Linking Open Data community effort,
in the overview article Linked Data - The Story So Far
and in the tutorial on How to publish Linked Data on the Web.
The Silk Link Discovery Framework supports data publishers in accomplishing the
second task. Using the declarative Silk - Link Specification Language (Silk-LSL), developers can specify which types of RDF links should be discovered between data sources as well as which conditions data items must
fulfill in order to be interlinked. These link conditions may combine various similarity
metrics and can take the graph around a data item into account, which is addressed
using an RDF path language. Silk accesses the data sources that should be interlinked via the SPARQL protocol and can thus be used against local as well as remote SPARQL endpoints.
Silk is provided in three different variants which address different use cases:
- Silk Single Machine is used to generate RDF links on a single machine. The datasets that should be interlinked can either reside on the same machine or on remote machines which are accessed via the SPARQL protocol. Silk Single Machine provides multithreading and caching. In addition, the performance is further enhanced using the MultiBlock blocking algorithm.
- Silk MapReduce is used to generate RDF links between data sets using a cluster of multiple machines. Silk MapReduce is based on Hadoop and can for instance be run on Amazon Elastic MapReduce. Silk MapReduce enables Silk to scale out to very big datasets by distributing the link generation to multiple machines.
- Silk Server can be used as an identity resolution component within applications that consume Linked Data from the Web. Silk Server provides an HTTP API for matching entities from an incoming stream of RDF data while keeping track of known entities. It can be used for instance together with a Linked Data crawler to populate a local duplicate-free cache with data from the Web.
All variants are based on the Silk Link Discovery Engine which offers the following features:
- Flexible, declarative language for specifying linkage rules
- Support of RDF link generation (owl:sameAs links as well as other types)
- Employment in distributed environments (by accessing local and remote SPARQL endpoints)
- Usable in situations where terms from different vocabularies are mixed and where no consistent RDFS or OWL schemata exist
- Scalability and high performance through efficient data handling (speedup factor of 20 compared to Silk 0.2):
- Reduction of network load by caching and reusing of SPARQL result sets
- Multi-threaded computation of the data item comparisons (3 million comparisons per minute on a Core2 Duo)
- Optional blocking of data items
- Amazon
- Anja Jentzsch
- API
- Athens
- Berlin
- Bonn
- caching
- Chris Bizer
- Christian Bizer
- Computing
- Data management
- European Union
- Georgi Kobilarov
- Germany
- HTTP
- International Journal
- Julius Volz
- Linked Data
- lod
- Madrid
- MapReduce
- Martin Gaedke
- MultiBlock blocking algorithm
- programming
- Python
- Query languages
- RDF
- rdf
- RDF query language
- Robert Isele
- Scala
- Semantic Web
- semantic web
- Shanghai
- Silk
- Silk Google Group
- Spain
- SPARQL
- SPARQL protocol
- Technology
- The Story So Far
- Tim Berners-Lee
- Tom Heath
- United States
- University of Leipzig
- Vulcan Inc.
- Web application
- web conference
- Web services
- Westfields
- World Wide Web