Skip navigation

database systems

warning: Creating default object from empty value in /var/www/vhosts/ on line 33.

This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage and query data stored in a Hadoop
cluster using the standard SQL query language. Unlike other database systems that provide only a relational view over HDFSresident data through the use of an external table mechanism, Polybase employs a split query processing paradigm in which
SQL operators on HDFS-resident data are translated into MapReduce jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper describes the design and implementation of Polybase along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop. Our results demonstrate that while the use of a splitbased query execution paradigm can improve the performance of some queries by as much as 10X, one must employ a cost-based query optimizer that considers a broad set of factors when deciding whether or not it is advantageous to push a SQL operator to Hadoop. These factors include the selectivity factor of the predicate, the relative sizes of the two clusters, and whether or not their nodes are co-located. In addition, differences in the semantics of the Java and SQL languages must be carefully considered in order to avoid altering the expected results of a query.

Link to the paper

Your rating: None
Original author: 
Todd Hoff

Erasure codes are one of those seemingly magical mathematical creations that with the developments described in the paper XORing Elephants: Novel Erasure Codes for Big Data, are set to replace triple replication as the data storage protection mechanism of choice.

The result says Robin Harris (StorageMojo) in an excellent article, Facebook’s advanced erasure codes: "WebCos will be able to store massive amounts of data more efficiently than ever before. Bad news: so will anyone else."

Robin says with cheap disks triple replication made sense and was economical. With ever bigger BigData the overhead has become costly. But erasure codes have always suffered from unacceptably long time to repair times. This paper describes new Locally Repairable Codes (LRCs) that are efficiently repairable in disk I/O and bandwidth requirements:

These systems are now designed to survive the loss of up to four storage elements – disks, servers, nodes or even entire data centers – without losing any data. What is even more remarkable is that, as this paper demonstrates, these codes achieve this reliability with a capacity overhead of only 60%.

They examined a large Facebook analytics Hadoop cluster of 3000 nodes with about 45 PB of raw capacity. On average about 22 nodes a day fail, but some days failures could spike to more than 100.

LRC test results found several key results.

  • Disk I/O and network traffic were reduced by half compared to RS codes.
  • The LRC required 14% more storage than RS, information theoretically optimal for the obtained locality.
  • Repairs times were much lower thanks to the local repair codes.
  • Much greater reliability thanks to fast repairs.
  • Reduced network traffic makes them suitable for geographic distribution.
  • LRC test results found several key results.
  • Disk I/O and network traffic were reduced by half compared to RS codes.

I wonder if we'll see a change in NoSQL database systems as well? 

Related Articles

Your rating: None

What do you do when you have a lot of things to display to the user, far more than can possibly fit on the screen? Paginate, naturally.


There are plenty of other real world examples in this 2007 article, but I wouldn't bother. If you've seen one pagination scheme, you've seen them all. The state of art in pagination hasn't exactly changed much – or at all, really – in the last 5 years.

I can understand paginating when you have 10, 50, 100, maybe even a few hundred items. But once you have thousands of items to paginate, who the heck is visiting page 964 of 3810? What's the point of paginating so much information when there's a hard practical limit on how many items a human being can view and process in any reasonable amount of time?

Once you have thousands of items, you don't have a pagination problem. You have a search and filtering problem. Why are we presenting hundreds or thousands of items to the user? What does that achieve? In a perfect world, every search would result in a page with a single item: exactly the thing you were looking for.


But perhaps you don't know exactly what you're looking for: maybe you want a variety of viewpoints and resources, or to compare a number of similar items. Fair enough. I have a difficult time imagining any scenario where presenting a hundred or so items wouldn't meet that goal. Even so, the items would naturally be presented in some logical order so the most suitable items are near the top.

Once we've chosen a suitable order and a subset of relevant items … do we really need pagination at all? What if we did some kind of endless pagination scheme, where we loaded more items into the view dynamically as the user reaches the bottom? Like so:

It isn't just oddball disemvowelled companies, either. Twitter's timeline and Google's image search use a similar endless pagination approach. Either the page loads more items automatically when you scroll down to the bottom, or there's an explicit "show more results" button.

Pagination is also friction. Ever been on a forum where you wished like hell the other people responding to the thread had read all four pages of it before typing their response? Well, maybe some of them would have if the next page buttons weren't so impossibly small, or better yet, not there at all because pagination was automatic and seamless. We should be actively removing friction where we want users to do more of something.

I'm not necessarily proposing that all traditional pagination be replaced with endless pagination. But we, as software developers, should avoid mindlessly generating a list of thousands upon thousands of possible items and paginating it as a lazy one-size-fits-all solution. This puts all the burden on the user to make sense of the items. Remember, we invented computers to make the user's life easier, not more difficult.

Once you've done that, there's a balance to be struck, as Google's research tells us:

User testing has taught us that searchers much prefer the view-all, single-page version of content over a component page containing only a portion of the same information with arbitrary page breaks.

Interestingly, the cases when users didn’t prefer the view-all page were correlated with high latency (e.g., when the view-all page took a while to load, say, because it contained many images). This makes sense because we know users are less satisfied with slow results. So while a view-all page is commonly desired, as a webmaster it’s important to balance this preference with the page’s load time and overall user experience.

Traditional pagination is not particularly user friendly, but endless pagination isn't without its own faults and pitfalls, either:

  • The scroll bar, the user's moral compass of "how much more is there?" doesn't work in endless pagination because it is effectively infinite. You'll need an alternate method of providing that crucial feedback, perhaps as a simple percent loaded text docked at the bottom of the page.
  • Endless pagination should not break deep linking. Even without the concept of a "page", users should be able to clearly and obviously link to any specific item in the list.
  • Clicking the browser forward or back button should preserve the user's position in the endless scrolling stream, perhaps using pushState.
  • Pagination may be a bad user experience, but it's essential for web spiders. Don't neglect to accommodate web search engines with a traditional paging scheme, too, or perhaps a Sitemap.
  • Provide visible feedback when you're dynamically loading new items in the list, so the user can tell that new items are coming, and their browser isn't hung – and that they haven't reached the bottom yet.
  • Remember that the user won't be able to reach the footer (or the header) any more, because items keep appearing as they scroll down in the river of endless content. So either move to static headers and footers, or perhaps use the explicit "load more" button instead of loading new content automatically.

For further reading, there's some excellent Q&A on the topic of pagination at ux.stackexchange.

Above all else, you should strive to make pagination irrelevant because the user never has to look at more than a few items to find what they need. That's why I suspect Google hasn't done much with this technique in their core search result pages; if they aren't providing great results on page 1, it doesn't really matter what kind of pagination they use because they're not going to be in business much longer. Take that lesson to heart: you should be worried most of all about presenting a relevant list of items to the user in a sensible order. Once you've got that licked, then and only then should you think about your pagination scheme.

[advertisement] What's your next career move? Stack Overflow Careers has the best job listings from great companies, whether you're looking for opportunities at a startup or Fortune 500. You can search our job listings or create a profile and let employers find you.

Your rating: None

Edge of Reality is an independent veteran studio based in Austin, Texas. We are working on multiple projects including some with the Sims Studio of Electronic Arts.
The studio has shipped 13 games over 13 years including several hits. The chemistry of the studio is very positive. Everyone puts aside their egos, and works hard to make the best game possible. Come join our team!

Your rating: None

To meet the service levels demanded by applications, database backend needs to deliver high performance.

Here you will find performance benchmark results of CSQL for database operations which application does 80% of the time. This benchmark is based on wisconsin benchmark which contains real time queries and database operations.
Machine Configuration: These tests were performed on normal desktop machine, Dell loaded with 1GB RAM on Linux 2.6 Kernel
Wisconsin Benchmark schema having tables 'big1' and 'big2' with 10K records each and 'small' table with 1K records.
Tests are performed after caching all records in memory for leading database management system.

Interface used: JDBC

CSQL Vs Leading DBMS

From the above results, it is evident that CSQL is approximately 30 times faster than leading database with standard JDBC interface for real time database operations. This demonstrates CSQL's ability to meet the most demanding service levels which traditional disk based database systems cannot deliver.

Your rating: None