Skip navigation
Help

data mining

warning: Creating default object from empty value in /var/www/vhosts/sayforward.com/subdomains/recorder/httpdocs/modules/taxonomy/taxonomy.pages.inc on line 33.
Original author: 
Todd Hoff

This is a guest post by Yelp's Jim Blomo. Jim manages a growing data mining team that uses Hadoop, mrjob, and oddjob to process TBs of data. Before Yelp, he built infrastructure for startups and Amazon. Check out his upcoming talk at OSCON 2013 on Building a Cloud Culture at Yelp.

In Q1 2013, Yelp had 102 million unique visitors (source: Google Analytics) including approximately 10 million unique mobile devices using the Yelp app on a monthly average basis. Yelpers have written more than 39 million rich, local reviews, making Yelp the leading local guide on everything from boutiques and mechanics to restaurants and dentists. With respect to data, one of the most unique things about Yelp is the variety of data: reviews, user profiles, business descriptions, menus, check-ins, food photos... the list goes on.  We have many ways to deal data, but today I’ll focus on how we handle offline data processing and analytics.

In late 2009, Yelp investigated using Amazon’s Elastic MapReduce (EMR) as an alternative to an in-house cluster built from spare computers.  By mid 2010, we had moved production processing completely to EMR and turned off our Hadoop cluster.  Today we run over 500 jobs a day, from integration tests to advertising metrics.  We’ve learned a few lessons along the way that can hopefully benefit you as well.

Job Flow Pooling

0
Your rating: None
Original author: 
Florence Ion


What if you could privately use an application and manage its permissions to keep ill-intending apps from accessing your data? That’s exactly what Steve Kondik at CyanogenMod—the aftermarket, community-based firmware for Android devices—hopes to bring to the operating system. It’s called Incognito Mode, and it’s designed to help keep your personal data under control.

Kondik, a lead developer with the CyanogenMod team, published a post on his Google Plus profile last week about Incognito Mode. He offered more details on the feature:

I've added a per-application flag which is exposed via a simple API. This flag can be used by content providers to decide if they should return a full or limited dataset. In the implementation I'm working on, I am using the flag to provide these privacy features in the base system:

  • Return empty lists for contacts, calendar, browser history, and messages.
  • GPS will appear to always be disabled to the running application.
  • When an app is running incognito, a quick panel item is displayed in order to turn it off easily.
  • No fine-grained permissions controls as you saw in CM7. It's a single option available under application details.

The API provides a simple isIncognito() call which will tell you if incognito is enabled for the process (or the calling process). Third party applications can honor the feature using this API, or they can choose to display pictures of cats instead of running normally.

Every time you install a new application on Android, the operating system asks you to review the permissions the app requests before it can install. This approach to user data is certainly precarious because users can't deny individual permissions to pick and choose what an application has access to, even if they still want to use that app. Incognito Mode could potentially fix this conundrum, enabling users to restrict their data to certain applications.

Read 9 remaining paragraphs | Comments

0
Your rating: None

Today's video is an interview with the Corporate Alliance Director and the Chief Technology Officer of the International Association of Privacy Professionals (IAPP), a non-profit organization that claims it is "...the largest and most comprehensive global information privacy community and resource, helping practitioners develop and advance their careers and organizations manage and protect their data." In other words, it's not the same as the much-beloved Electronic Privacy Information Center (EPIC), but is -- as its name implies -- a group of people engaged in privacy protection as part of their work or whose work is about privacy full-time, which seems to be the case for more and more IT and Web people lately, what with HIPAA and other privacy-oriented regulations. This is a growing field, well worth learning more about.

Share on Google+

Read more of this story at Slashdot.

0
Your rating: None

hal380The advent of Salesforce Marketing Cloud and Adobe Marketing Cloud demonstrates the need for enterprises to develop new ways of harnessing the vast potential of big data. Yet these marketing clouds beg the question of who will help marketers, the frontline of businesses, maximize marketing spending and ROI and help their brands win in the end. Simply moving software from onsite to hosted servers does not change the capabilities marketers require — real competitive advantage stems from intelligent use of big data.

Marc Benioff, who is famous for declaring that “Software Is Dead,” may face a similar fate with his recent bets on Buddy Media and Radian6. These applications provide data to people who must then analyze, prioritize and act — often at a pace much slower than the digital world. Data, content and platform insights are too massive for mere mortals to handle without costing a fortune. Solutions that leverage big data are poised to win — freeing up people to do the strategy and content creation that is best done by humans, not machines.

Big data is too big for humans to work with, at least in the all-important analytical construct of responding to opportunities in real time — formulating efficient and timely responses to opportunities generated from your marketing cloud, or pursuing the never-ending quest for perfecting search engine optimization (SEO) and search engine marketing (SEM). The volume, velocity and veracity of raw, unstructured data is overwhelming. Big data pioneers such as Facebook and eBay have moved to massive Hadoop clusters to process their petabytes of information.

In recent years, we’ve gone from analyzing megabytes of data to working with gigabytes, and then terabytes, and then petabytes and exabytes, and beyond. Two years ago, James Rogers, writing in The Street, wrote: “It’s estimated that 1 Petabyte is equal to 20 million four-door filing cabinets full of text.” We’ve become jaded to seeing such figures. But 20 million filing cabinets? If those filing cabinets were a standard 15 inches wide, you could line them up, side by side, all the way from Seattle to New York — and back again. One would need a lot of coffee to peruse so much information, one cabinet at a time. And, a lot of marketing staff.

Of course, we have computers that do the perusing for us, but as big data gets bigger, and as analysts, marketers and others seek to do more with the massive intelligence that can be pulled from big data, we risk running into a human bottleneck. Just how much can one person — or a cubicle farm of persons — accomplish in a timely manner from the dashboard of their marketing cloud? While marketing clouds do a fine job of gathering data, it still comes down to expecting analysts and marketers to interpret and act on it — often with data that has gone out of date by the time they work with it.

Hence, big data solutions leveraging machine learning, language models and prediction, in the form of self-learning solutions that go from using algorithms for harvesting information from big data, to using algorithms to initiate actions based on the data.

Yes, this may sound a bit frightful: Removing the human from the loop. Marketers indeed need to automate some decision-making. But the human touch will still be there, doing what only people can do — creating great content that evokes emotions from consumers — and then monitoring and fine-tuning the overall performance of a system designed to take actions on the basis of big data.

This isn’t a radical idea. Programmed trading algorithms already drive significant activity across stock markets. And, of course, Amazon, eBay and Facebook have become generators of — and consummate users of — big data. Others are jumping on the bandwagon as well. RocketFuel uses big data about consumers, sites, ads and prior ad performance to optimize display advertising. Turn.com uses big data from consumer Web behavior, on-site behaviors and publisher content to create, optimize and buy advertising across the Web for display advertisers.

The big data revolution is just beginning as it moves beyond analytics. If we were building CRM again, we wouldn’t just track sales-force productivity; we’d recommend how you’re doing versus your competitors based on data across the industry. If we were building marketing automation software, we wouldn’t just capture and nurture leads generated by our clients, we’d find and attract more leads for them from across the Web. If we were building a financial application, it wouldn’t just track the financials of your company, it would compare them to public filings in your category so you could benchmark yourself and act on best practices.

Benioff is correct that there’s an undeniable trend that most marketing budgets today are betting more on social and mobile. The ability to manage social, mobile and Web analysis for better marketing has quickly become a real focus — and a big data marketing cloud is needed to do it. However, the real value and ROI comes from the ability to turn big data analysis into action, automatically. There’s clearly big value in big data, but it’s not cost-effective for any company to interpret and act on it before the trend changes or is over. Some reports find that 70 percent of marketers are concerned with making sense of the data and more than 91 percent are concerned with extracting marketing ROI from it. Incorporating big data technologies that create action means that your organization’s marketing can get smarter even while you sleep.

Raj De Datta founded BloomReach with 10 years of enterprise and entrepreneurial experience behind him. Most recently, he was an Entrepreneur-In-Residence at Mohr-Davidow Ventures. Previously, he was a Director of Product Marketing at Cisco. Raj also worked in technology investment banking at Lazard Freres. He holds a BSE in Electrical Engineering from Princeton and an MBA from Harvard Business School.

0
Your rating: None

theodp writes "Mother Jones reports on Obama's Digital Gurus, the top-secret team of analytics engineers and scientists led by hipster CTO Harper Reed who work on text analytics, social network/media analysis, web personalization, computational advertising, and online experiments & testing from the campaign's Chicago HQ and satellite offices. For OFA (Obama for America), writes Tim Murphy, there is no such thing as Too Much Information. 'In terms of just the sheer amount of data that political candidates have on you,' says UNC Prof Daniel Kreiss, 'I think everyone finds it creepy.' Still playing catch-up to OFA in its data efforts is Team Romney, which reportedly hired former employees from places like Google Analytics, Apple, Ominture, and Overstock.com in an attempt to reverse engineer the Obama campaign's strategy."


Share on Google+

Read more of this story at Slashdot.

0
Your rating: None


The Economics of Interaction: How We Can Use Microeconomics to Describe System Interaction

The Economics of Interaction: How We Can Use Microeconomics to Describe the Interaction Between User and System A Google Tech Talk June 7, 2012 Presented by Leif Azzopardi ABSTRACT Searching is inherently an interactive process usually requiring numerous iterations of querying and assessing in order to find the desired amount of relevant information. Essentially, the search process can be viewed as a combination of inputs (queries and assessments) which are used to "produce'' output (relevance). Under this view, it is possible to adapt microeconomic theory to analyze and understand the dynamics of Interactive Information Retrieval. In this talk, I will describe how the search process can be treated as an economics problem and then go on to describe a series of simulations on TREC test collections where I analyzed various combinations of inputs in the "production'' of relevance. The analysis reveals that the total Cumulative Gain obtained during the course of a search session is functionally related to querying and assessing. Furthermore, this relationship can be characterized mathematically by the Cobbs-Douglas production function. Then in a subsequent analysis using cost models, I show which search strategies minimize the cost of interaction for a given level of output. And these developments establishes the theoretical foundations of Interactive Information Retrieval, providing numerous directions for empirical experimentation that are motivated directly from theory <b>...</b>
From:
GoogleTechTalks
Views:
4010

38
ratings
Time:
49:55
More in
Science & Technology

0
Your rating: None

The technology reporters and editors of The New York Times scour the Web for important and peculiar items. Tuesday's selection includes 30 of India's technology leaders, a photo-sharing iPhone app jumping to Android and a comic strip look at a possible future technological discovery,

0
Your rating: None

Y_Combinator-logo-USETHIS

The startups that presented at Y Combinator’s Demo Day last week were remarkable in their own right, but perhaps the most striking thing was the sheer number of them.

With 66 companies and 180 founders in this season’s batch, the auditorium at Mountain View’s Computer History Museum was practically bursting with angel investors and reps from every notable venture firm last week. And that was just the latest class. Since 2005, Y Combinator has since spawned more than a dozen batches of startups including Dropbox and Airbnb. The last two classes alone have created more than 120 companies.

So it raises the question of how Y Combinator has been able to grow in size while sustaining both the quality of startups it churns out and the value it provides for founders.

Essentially, how do you scale a company that creates companies?

The Strategy and Vision

“Our whole approach to scaling Y Combinator is the standard approach to scaling software,” said Paul Graham, Y Combinator’s co-founder.

There are a couple rules, he said. 1) You can’t predict in advance where the bottlenecks will be so you just keep going until you hit the next one and 2) You can always scale a lot more than you originally predicted. ”When you scale things, they often turn into other stuff that you would have never imagined,” he said.

Graham doesn’t have an exact size in mind when accepting companies for a new class. The early-stage venture firm accepts as many companies as the team thinks are worthy. Nor does Graham know how large Y Combinator should ultimately be.

“Imagine if you had asked Mark Zuckerberg that question when Facebook had just two universities,” Graham said. “A lot of what drives us is curiosity about what happens when something like this gets bigger.”

Indeed, some of the other partners liken working at Y Combinator to building a university or a new type of institution that’s never been seen before.

“If you think of YC as a corporation or a company, it has these characteristics that every big company would love to have,” said Harj Taggar, an alum who later became a YC venture partner. “It’s a bunch of smart people working on projects that they love and have upside in. But they are all linked together and get the benefits of being a part of a larger group. YC is effectively inventing a new form of organization.”

Given the scale of Graham’s ambition (which shouldn’t be surprising since he tells founders to have “frighteningly ambitious” startup ideas), we walked through some of the many bottlenecks YC has faced through the years:

Applications:

Y Combinator’s increasing cachet has brought a ballooning number of applications. Last October, Graham said that the firm was seeing about one submission per minute on deadline day for the most recent class.

Every one of the firm’s venture partners used to read every application. Now they don’t. They might read one-third of the applications. It’s the alumni who make the first pass, depending on how much time they have. Some do none while others read as many as 100 applications or more.

“We went back over the years and saw that we had never accepted a company for an interview where the alumni were majority ‘No,’” Taggar said. “This weeds out really bad applications so we can focus on the borderline ones, which take more time.”

But just in case they miss a potentially good company, Y Combinator is starting to use data mining software. They’ve fed a program all of the old Y Combinator applications to find predictors of success and apply them to new submissions, creating a backstop in case they miss something.

“There are two kinds of mistakes: funding a bad startup or missing a good one. Our biggest fear is missing a good startup,” Graham said, adding that Dropbox’s co-founder Drew Houston was actually rejected the first time around. They’ve used the program to generate a top 10 list of factors predicting the probability of acceptance. ”I don’t want to share it, but it was fascinating,” Graham said.

After they pick a cohort of companies to interview, they fly them in. They used to do a single track interview process where every single partner had to be present in the room. Last time, they did two interview tracks with half the partners in one of two rooms that went through half the finalists each. This time, they might do three tracks simultaneously.

Following the interview, the partners decide immediately within the next five minutes about whether they should accept the company or not.

“We have to be very disciplined,” Taggar said. “By the end of the day, when you’ve done twenty-something interviews, you can barely remember what happened in the first one.”

Advising:

Y Combinator’s big initial bottleneck was that there was one Paul Graham, and he only had 24 hours in a day. So the company brought on additional venture partners like Gmail creator Paul Buchheit and alumni like Taggar, Posterous co-founder Garry Tan and Aaron Iba, who successfully sold AppJet to Google. Geoff Ralston, who was chief executive of Lala, the music startup that exited to Apple in 2009, is joining as a partner for this round. Plus there are part-time partners like Loopt co-founder Sam Altman and Justin.tv founders Emmett Shear and Justin Kan. They joined YC’s original partners Jessica Livingston, Trevor Blackwell and Robert Morris.

“It turns out that this is almost perfectly parallelizable,” Graham said. “I know from experience that one partner can deal with 20 startups and if we have 66 startups, we’re at more than 2X over capacity.”

All of the partners are available for office hours and there’s an internal scheduling tool that Y Combinator uses to gauge demand and urgency from founders. Ash Rust, who co-founded SendHub, had an HR issue once. He was able to get office hours within 30 minutes and the right documentation almost immediately after that.

“I know how hard it can be to get help as a founder if you’re not the belle of the ball,” Rust said. “But I’ve never experienced that here.”

If that still sounds a little impersonal for something as unpredictable and idiosyncratic as founding a startup, Buchheit points out that YC’s alumni network is now so large that the firm is starting to have world-class experts on running companies in many areas.

“As YC gets larger, it actually gets better,” Buchheit said, pointing to the firm’s 800 alumni. ”Half the time, I’m sending founders to talk to different alumni. If you’re doing a video startup, then I know the person you really ought to talk to is Justin Kan.”

The firm taps this alumni network when it holds mini-conferences around issues like user acquisition or iOS development.

“There’s this real feeling of appreciation,” Buchheit says. “The founders are very grateful for the experience, so they have a real loyalty and want to help out other companies. There’s a little bit of a pay-it-forward model built into the network.”

Tan even built a private social networking tool for YC founders. Taggar says it’s useful for putting faces to names and that they’ll probably add a section for skills like the ability to code in Python and so on.

Y Combinator’s emerging network effects:

Not only are alumni helping with admissions and advising, they can serve as market-makers for new startups. Many of mobile payment startup Stripe’s customers are part of Y Combinator while Exec is now offering special corporate accounts to run errands for other startups.

“Y Combinator has a built-in economy,” Buchheit says. “We have this tremendous network and another YC company can be your first reference customer when others won’t take the risk.”

Then if one company isn’t quite a home run, its founders and employees will likely be able to find work at another Y Combinator startup. When Jeff and Dan Morin were considering next steps after working on event startup Anyvite for a few years, Graham paired them with another founder, Olga Vidisheva, from the most recent batch. Now they’ve rounded up funding from Greylock Capital, Andreessen Horowitz, SV Angel and Benchmark Capital to bring independent fashion boutiques online at Shoptiques.

The alumni also come back to Demo Day to angel invest in startups from later batches and companies like Parse, Carwoo and Dropbox have raised angel funding from other alums.

Demo Day and Investors:

Maybe the next big bottleneck is the most obvious one: helping investors wade through the dozens of startups it launches every half-year. The firm had to move Demo Day to The Computer History Museum because its offices no longer had space to fit the hundreds of investors. Y Combinator is also reaching the upper limit of how many startups can pitch in a single day.

Getting through 66 pitches is a slog. ”I don’t think we could handle a Demo Week,” Buchheit joked.

Taggar says he’s thinking about how to make it more efficient for investors to set up meetings with the right startups following Demo Day. Right now, the partners just have a mental map of the investor landscape and try to route the right companies to the right investors.

The week after Demo Day is an especially intense one as entrepreneurs and investors try to lock down deals. It’s kind of a weird biannual version of mating season.

With all the investor interest, the founders clearly don’t see Demo Day as the issue.

In fact, Rust had something else on his mind — how to efficiently get food on speaker nights. ”Seriously, the only scaling problem is the enormous dinner line,” he said.

0
Your rating: None