Skip navigation
Help

data

warning: Creating default object from empty value in /var/www/vhosts/sayforward.com/subdomains/recorder/httpdocs/modules/taxonomy/taxonomy.pages.inc on line 33.
Original author: 
Todd Hoff

This is a guest post by Yelp's Jim Blomo. Jim manages a growing data mining team that uses Hadoop, mrjob, and oddjob to process TBs of data. Before Yelp, he built infrastructure for startups and Amazon. Check out his upcoming talk at OSCON 2013 on Building a Cloud Culture at Yelp.

In Q1 2013, Yelp had 102 million unique visitors (source: Google Analytics) including approximately 10 million unique mobile devices using the Yelp app on a monthly average basis. Yelpers have written more than 39 million rich, local reviews, making Yelp the leading local guide on everything from boutiques and mechanics to restaurants and dentists. With respect to data, one of the most unique things about Yelp is the variety of data: reviews, user profiles, business descriptions, menus, check-ins, food photos... the list goes on.  We have many ways to deal data, but today I’ll focus on how we handle offline data processing and analytics.

In late 2009, Yelp investigated using Amazon’s Elastic MapReduce (EMR) as an alternative to an in-house cluster built from spare computers.  By mid 2010, we had moved production processing completely to EMR and turned off our Hadoop cluster.  Today we run over 500 jobs a day, from integration tests to advertising metrics.  We’ve learned a few lessons along the way that can hopefully benefit you as well.

Job Flow Pooling

0
Your rating: None
Original author: 
David Pescovitz

Bolidessss

Carlo Zapponi created Bolides, a fantastic animated visualization of meteorites that have been seen hitting the Earth. The data source is the Nomenclature Committee of the Meteoritical Society's Meteorite Bulletin. "The word bolide comes from Greek βολίς bolis, which means missile. Astronomers tend to use bolide to identify an exceptionally bright fireball, particularly one that explodes." Bolides    

0
Your rating: None

GCirc01E-025

Update: In my eagerness to announce these workshops I made a scheduling error, incorrectly thinking the dates would be March 15+16 rather than 16+17. As a result I need to move one of the workshops to the weekend before, and since the Intro workshop should happen before the Advanced the new dates will be:

  • Saturday March 9: Introduction to Processing and Generative Art
  • Saturday March 16: Generative Art, Advanced Topics

Sorry for the confusion! On the plus side the Intro workshop might now be a smaller group which should make it nice and intimate.

I haven’t done any workshops in New York since November, so I have decided to offer my Intro and Advanced Generative Art workshops back-to-back the weekend of March 16+17 on consecutive weekends, Saturday March 9 and Saturday March 17.

The venue will be my apartment in comfortable Park Slope, Brooklyn. As usual I have 8 spots available for each workshop, they do tend to reach capacity so get in touch sooner rather than later. Reservation is by email and your spot is confirmed once I receive payment via PayPal.

The workshops will be taught using the most recent Processing 2.0 beta version (2.0b8 as of this moment), and as usual I will be using my own Modelbuilder library as a toolkit for solving the tasks we look. Familiarizing yourself with Processing 2.0 and Modelbuilder would be good preparation.

Make sure to download Modelbuilder-0019 and Control-P5 2.0.4, then run through the provided examples. Check OpenProcessing.org for more Modelbuilder examples.

Note about dataviz: I know there is a lot of interest in data vizualization and I do get asked about that frequently in workshops. I can’t promise to cover data in detail since it’s a pretty big topic.

If you’re specifically looking for data techniques I would recommend looking at the excellent workshops series taught by my friend Jer Thorp. He currently offers two such workshops, titled “Processing and Data Visualization” and “Archive, Text, & Character(s)”.

0
Your rating: None

datum380

Image copyright kentoh

In a series of articles last year, executives from the ad-data firms BlueKai, eXelate and Rocket Fuel debated whether the future of online advertising lies with “More Data” or “Better Algorithms.” Omar Tawakol of BlueKai argues that more data wins because you can drive more effective marketing by layering additional data onto an audience. While we agree with this, we can’t help feeling like we’re being presented with a false choice.

Maybe we should think about a solution that involves smaller amounts of higher quality data instead of more data or better algorithms.

First, it’s important to understand what data is feeding the marketing ecosystem and how it’s getting there. Most third-party profiles consist of data points inferred from the content you consume, forms you fill out and stuff you engage with online. Some companies match data from offline databases with your online identity, and others link your activity across devices. Lots of energy is spent putting trackers on every single touchpoint. And yet the result isn’t very accurate — we like to make jokes around the office about whether one of our colleagues’ profiles says they’re a man or a woman that day. Truth be told, on most days BlueKai thinks they are both.

One way to increase the quality of data would be to change where we get it from.

Instead of scraping as many touchpoints as possible, we could go straight to the source: The individual. Imagine the power of data from across an individual’s entire digital experience — from search to social to purchase, across devices. This kind of data will make all aspects of online advertising more efficient: True attribution, retargeting-type performance for audience targeting, purchase data, customized experiences.

So maybe the solution to “More Data” vs. “Better Algorithms” isn’t incremental improvements to either, but rather to invite consumers to the conversation and capture a fundamentally better data set. Getting this new type of data to the market won’t be easy. Four main hurdles need to be cleared for the market to reach scale.

Control and Comfort

When consumers say they want “privacy,” they don’t normally desire the insular nature of total anonymity. Rather, they want control over what is shared and with whom. Any solution will need to give consumers complete transparent control over their profiles. Comfort is gained when consumers become aware of the information that advertisers are interested in — in most cases, the data is extremely innocuous. A Recent PWC survey found that 80 percent of people are willing to share “information if a company asks up front and clearly states use.”

Remuneration

Control and Comfort are both necessary, but people really want to share in the value created by their data. Smart businesses will offer things like access to content, free shipping, coupons, interest rate discounts or even loyalty points to incentivize consumers to transact using data. It’s not much of a stretch to think that consumers who feel fairly compensated will upload even more data into the marketing cloud.

Trust and Transparency

True transparency around what data is gathered and what happens to it engenders trust. Individuals should have the final say about which of their data is sold. Businesses will need to adopt best practices and tools that allow the individual to see and understand what is happening with their data. A simple dashboard with delete functionality should do, for a start.

Ease of Use

This will all be moot if we make it hard for consumers to participate. Whatever system we ask them to adopt needs to be dead simple to use, and offer enough benefits for them to take the time and effort to switch. Here we can apply one of my favorite principles from Ruby on Rails — convention over configuration. There is so much value in data collected directly from individuals that we can build a system whose convention is to protect even the least sensitive of data points and still respect privacy, without requiring the complexity needed for configuration.

The companies who engage individuals around how their data is used and collected will have an unfair advantage over those who don’t. Their advertising will be more relevant, they’ll be able to customize experiences and measure impact to a level of precision impossible via third-party data. To top it off, by being open and honest with their consumers about data, they’ll have impacted that intangible quality that every brand strives for: Authenticity.

In the bigger picture, the advertising industry faces an exciting opportunity. By treating people and their data with respect and involving them in the conversation around how their data is used, we help other industries gain access to data by helping individuals feel good about transacting with it. From healthcare to education to transportation, society stands to gain if people see data as an opportunity and not a threat.

Marc is the co-founder and CEO of Enliken, a startup focused on helping businesses and consumers transact with data. Currently, it offers tools for publishers and readers to exchange data for access to content. Prior to Enliken, Marc was the founding CEO of Spongecell, an interactive advertising platform that produced one of the first ad units to run on biddable media.

0
Your rating: None

Faced with the need to generate ever-greater insight and end-user value, some of the world’s most innovative companies — Google, Facebook, Twitter, Adobe and American Express among them — have turned to graph technologies to tackle the complexity at the heart of their data.

To understand how graphs address data complexity, we need first to understand the nature of the complexity itself. In practical terms, data gets more complex as it gets bigger, more semi-structured, and more densely connected.

We all know about big data. The volume of net new data being created each year is growing exponentially — a trend that is set to continue for the foreseeable future. But increased volume isn’t the only force we have to contend with today: On top of this staggering growth in the volume of data, we are also seeing an increase in both the amount of semi-structure and the degree of connectedness present in that data.

Semi-Structure

Semi-structured data is messy data: data that doesn’t fit into a uniform, one-size-fits-all, rigid relational schema. It is characterized by the presence of sparse tables and lots of null checking logic — all of it necessary to produce a solution that is fast enough and flexible enough to deal with the vagaries of real world data.

Increased semi-structure, then, is another force with which we have to contend, besides increased data volume. As data volumes grow, we trade insight for uniformity; the more data we gather about a group of entities, the more that data is likely to be semi-structured.

Connectedness

But insight and end-user value do not simply result from ramping up volume and variation in our data. Many of the more important questions we want to ask of our data require us to understand how things are connected. Insight depends on us understanding the relationships between entities — and often, the quality of those relationships.

Here are some examples, taken from different domains, of the kinds of important questions we ask of our data:

  • Which friends and colleagues do we have in common?
  • What’s the quickest route between two stations on the metro?
  • What do you recommend I buy based on my previous purchases?
  • Which products, services and subscriptions do I have permission to access and modify? Conversely, given this particular subscription, who can modify or cancel it?
  • What’s the most efficient means of delivering a parcel from A to B?
  • Who has been fraudulently claiming benefits?
  • Who owns all the debt? Who is most at risk of poisoning the financial markets?

To answer each of these questions, we need to understand how the entities in our domain are connected. In other words, these are graph problems.

Why are these graph problems? Because graphs are the best abstraction we have for modeling and querying connectedness. Moreover, the malleability of the graph structure makes it ideal for creating high-fidelity representations of a semi-structured domain. Traditionally relegated to the more obscure applications of computer science, graph data models are today proving to be a powerful way of modeling and interrogating a wide range of common use cases. Put simply, graphs are everywhere.

Graph Databases

Today, if you’ve got a graph data problem, you can tackle it using a graph database — an online transactional system that allows you to store, manage and query your data in the form of a graph. A graph database enables you to represent any kind of data in a highly accessible, elegant way using nodes and relationships, both of which may host properties:

  • Nodes are containers for properties, which are key-value pairs that capture an entity’s attributes. In a graph model of a domain, nodes tend to be used to represent the things in the domain. The connections between these things are expressed using relationships.
  • A relationship has a name and a direction, which together lend semantic clarity and context to the nodes connected by the relationship. Like nodes, relationships can also contain properties: Attaching one or more properties to a relationship allows us to weight that relationship, or describe its quality, or otherwise qualify its applicability for a particular query.

The key thing about such a model is that it makes relations first-class citizens of the data, rather than treating them as metadata. As real data points, they can be queried and understood in their variety, weight and quality: Important capabilities in a world of increasing connectedness.

Graph Databases in Practice

Today, the most innovative organizations are leveraging graph databases as a way to solve the challenges around their connected data. These include major names such as Google, Facebook, Twitter, Adobe and American Express. Graph databases are also being used by organizations in a range of fields including finance, education, web, ISV and telecom and data communications.

The following examples offer use case scenarios of graph databases in practice.

  • Adobe Systems currently leverages a graph database to provide social capabilities to its Creative Cloud — a new array of services to media enthusiasts and professionals. A graph offers clear advantages in capturing Adobe’s rich data model fully, while still allowing for high performance queries that range from simple reads to advanced analytics. It also enables Adobe to store large amounts of connected data across three continents, all while maintaining high query performance.
  • Europe’s No. 1 professional network, Viadeo, has integrated a graph database to store all of its users and relationships. Viadeo currently has 40 million professionals in its network and requires a solution that is easy to use and capable of handling major expansion. Upon integrating a graph model, Viadeo has accelerated its system performance by more than 200 percent.
  • Telenor Group is one of the top ten wireless Telco companies in the world, and uses a graph database to manage its customer organizational structures. The ability to model and query complex data such as customer and account structures with high performance has proven to be critical to Telenor’s ongoing success.

An access control graph. Telenor uses a similar data model to manage products and subscriptions.

An access control graph. Telenor uses a similar data model to manage products and subscriptions.

  • Deutsche Telekom leverages a graph database for its highly scalable social soccer fan website attracting tens of thousands of visitors during each soccer match, where it provides painless data modeling, seamless data model extendibility, and high performance and reliability.
  • Squidoo is the popular social publishing platform where users share their passions. They recently created a product called Postcards, which are single-page, beautifully designed recommendations of books, movies, music albums, quotes and other products and media types. A graph database ensures that users have an awesome experience as it provides a primary data store for the Postcards taxonomy and the recommendation engine for what people should be doing next.

Such examples prove the pervasiveness of connections within data and the power of a graph model to optimally map relationships. A graph database allows you to further query and analyze such connections to provide greater insight and end-user value. In short, graphs are poised to deliver true competitive advantage by offering deeper perspective into data as well as a new framework to power today’s revolutionary applications.

A New Way of Thinking

Graphs are a new way of thinking for explicitly modeling the factors that make today’s big data so complex: Semi-structure and connectedness. As more and more organizations recognize the value of modeling data with a graph, they are turning to the use of graph databases to extend this powerful modeling capability to the storage and querying of complex, densely connected structures. The result is the opening up of new opportunities for generating critical insight and end-user value, which can make all the difference in keeping up with today’s competitive business environment.

Emil is the founder of the Neo4j open source graph database project, which is the most widely deployed graph database in the world. As a life-long compulsive programmer who started his first free software project in 1994, Emil has with horror witnessed his recent degradation into a VC-backed powerpoint engineer. As the CEO of Neo4j’s commercial sponsor Neo Technology, Emil is now mainly focused on spreading the word about the powers of graphs and preaching the demise of tabular solutions everywhere. Emil presents regularly at conferences such as JAOO, JavaOne, QCon and OSCON.

0
Your rating: None

Image via vichie81

Recently, Omar Tawakol from BlueKai wrote a fascinating article positing that more data beats better algorithms. He argued that more data trumps a better algorithm, but better still is having an algorithm that augments your data with linkages and connections, in the end creating a more robust data asset.

At Rocket Fuel, we’re big believers in the power of algorithms. This is because data, no matter how rich or augmented, is still a mostly static representation of customer interest and intent. To use data in the traditional way for Web advertising, choosing whom to show ads on the basis of the specific data segments they may be in represents one very simple choice of algorithm. But there are many others that can be strategically applied to take advantage of specific opportunities in the market, like a sudden burst of relevant ad inventory or a sudden increase in competition for consumers in a particular data segment. The algorithms can react to the changing usefulness of data, such as data that indicates interest in a specific time-sensitive event that is now past. They can also take advantage of ephemeral data not tied to individual behavior in any long-term way, such as the time of day or the context in which the person is browsing.

So while the world of data is rich, and algorithms can extend those data assets even further, the use of that data can be even more interesting and challenging, requiring extremely clever algorithms that result in significant, measurable improvements in campaign performance. Very few of these performance improvements are attributable solely to the use of more data.

For the sake of illustration, imagine you want to marry someone who will help you produce tall, healthy children. You are sequentially presented with suitors whom you have to either marry, or reject forever. Let’s say you start with only being able to look at the suitor’s height, and your simple algorithm is to “marry the first person who is over six feet tall.” How can we improve on these results? Using the “more data” strategy, we could also look at how strong they are, and set a threshold for that. Alternatively, we could use the same data but improve the algorithm: “Measure the height of the first third of the people I see, and marry the next person who is taller than all of them.” This algorithm improvement has a good chance of delivering a better result than just using more data with a simple algorithm.

Choosing opportunities to show online advertising to consumers is very much like that example, except that we’re picking millions of “suitors” each day for each advertiser, out of tens of billions of opportunities. As with the marriage challenge, we find it is most valuable to make improvements to the algorithms to help us make real-time decisions that grow increasingly optimal with each campaign.

There’s yet another dimension not covered in Omar’s article: the speed of the algorithms and data access, and the capacity of the infrastructure on which they run. The provider you work with needs to be able to make more decisions, faster, than any other players in this space. Doing that calls for a huge investment in hardware and software improvements at all layers of the stack. These investments are in some ways orthogonal to Omar’s original question: they simultaneously help optimize the performance of the algorithms, and they ensure the ability to store and process massive amounts of data.

In short, if I were told I had to either give up all the third-party data I might use, or give up my use of algorithms, I would give up the data in a heartbeat. There is plenty of relevant data captured through the passive activity of consumers interacting with Web advertising — more than enough to drive great performance for the vast majority of clients.

Mark Torrance is CTO of Rocket Fuel, which provides artificial-intelligence advertising solutions.

0
Your rating: None

The inside of Equinix's co-location facility in San Jose—the home of CloudFlare's primary data center.

Photo: Peter McCollough/Wired.com

On August 22, CloudFlare, a content delivery network, turned on a brand new data center in Seoul, Korea—the last of ten new facilities started across four continents in a span of thirty days. The Seoul data center brought CloudFlare's number of data centers up to 23, nearly doubling the company's global reach—a significant feat in itself for a company of just 32 employees.

But there was something else relatively significant about the Seoul data center and the other 9 facilities set up this summer: despite the fact that the company owned every router and every server in their racks, and each had been configured with great care to handle the demands of CloudFlare's CDN and security services, no one from CloudFlare had ever set foot in them. All that came from CloudFlare directly was a six-page manual instructing facility managers and local suppliers on how to rack and plug in the boxes shipped to them.

"We have nobody stationed in Stockholm or Seoul or Sydney, or a lot of the places that we put these new data centers," CloudFlare CEO Matthew Prince told Ars. "In fact, no CloudFlare employees have stepped foot in half of the facilities where we've launched." The totally remote-controlled data center approach used by the company is one of the reasons that CloudFlare can afford to provide its services for free to most of its customers—and still make a 75 percent profit margin.

Read 24 remaining paragraphs | Comments

0
Your rating: None

At Google's Zeitgeist conference, its chairman, Eric E. Schmidt, described a long-term future in which life is managed by robots - and one a little bit closer to reality, in which billions more people can get access to information with new devices and connectivity.

0
Your rating: None

The centuries-old scientific and engineering idea of progress through observing, modeling, testing and modifying is under attack. Now it is better to collect and examine lots of data, looking for patterns, and follow up on the most promising. The latest example: Autodesk says designers should generate a thousand product versions.

0
Your rating: None