Jim Gray on Building Petabyte Databases with SQL Server

We are proud to have Jim Gray of the Microsoft Research Group deliver the VSLive! San Francisco Keynote Build Petabyte Databases with SQL Server

Jim Gray was a key player in building the TerraServer, a 12-terabyte spatial data warehouse using Microsoft SQL Server 2000 as its building blocks. The TerraServer Web site averages 7 million hits per day. That's the sort of project Jim Gray likes. He founded and runs Microsoft's Bay Area Research Center (BARC), a research lab that focuses on scalable computing by building superservers and workgroup systems from commodity software and hardware. These modular systems also have built-in fault tolerance, allowing continuous operation in case of failure. As a sneak preview to this session, check out the interview with Jim below!




How will the upcoming integration of .NET into SQL Server affect performance, use of SQL Server as a data source for Web services, and so on. And how will it work? Will SQL Server essentially become a .NET host? Or just hooks with .NET Framework namespaces and classes?
The integration of SQL Server with .NET is a major advance. It's the next step after OLE DB-integrating objects with databases. SQL Server becomes the data integrator, the way to deal with sets of objects. We have been chasing the Object-Relational rainbow and we are near the pot of gold—an extensible database. Now you will be able to store objects in the database, and you will be able to treat databases as objects. Performance should be even better with compiled languages than with interpreted T-SQL. And now stored procedures will be first-class objects.

SQL Server appears to be taking on more and more roles for any and all types of static data stores. What types of data stores can you foresee SQL Server replacing? The Registry? The file services? E-mail storage?
Yes. Oracle, DB2, and SQL Server are each on a path to store "files": photos, videos, voicemail, e-mail, documents, and everything else. The integration of SQL Server with an object system (.NET) makes this attractive. Microsoft has been shooting for this for a long time. To the database folks it seems a natural evolution. To others it seems unnatural. There is a competition happening—each camp proposing a way to solve the problems. No doubt the end result will be a synthesis of all these ideas. But I think we all agree that storage management has to be more automatic and easier. We all have a common goal. The only debate is how to get there.

How can modern DBMSs deal with the impedance mismatch between relational databases and nonrelational data (e.g., XML)? What innovations are on the horizon to make access to hierarchical and nonrelational data transparent? What's new in terms of performance and scalability with nonrelational data?
Much has been made of the impedance mismatch. SQL is set-oriented and Cobol-C-Java-C# are iterator-oriented (one record at a time, please). The languages have rich type systems, while databases have poor type systems. This is about to change with .NET. The database is a Web service. The type system is universal. You can add databases to types; you can add types to databases. There is no longer an inside and an outside—the database is both above and below the type system. XQuery extends SQL to operate on trees and graphs (unifies SQL and XPath). That story is not over yet, but XQuery is our best answer so far. We are still exploring how to handle XML data: Do we shred it all, do we BLOb it, or something in between? Between is the obvious answer…

What are your predictions about the future of SQL Server Analysis Services in the business world, and the fate of OLAP, data mining, and business intelligence in general? Is Microsoft doing enough to evangelize the usefulness of these technologies in industry? What prevents these technologies from becoming more mainstream and commonplace? Will Microsoft's new Data Analyzer product help here?
Data Analyzer will definitely help Analysis Services both in terms of OLAP and data mining. One aspect of SQL Server is that everyone gets everything. There is no separate charge for replication or for OLAP or for data mining or English Query. So the fact that these features are widely used is not so obvious. I am told that each of these features is very widely used. Certainly, when I go to the users-group meetings and when I lurk in the discussion groups, there is a huge amount of interest and innovation in these areas. The basic evolution seems to be that people build an operational system using SQL Server, and then they build some ancillary decision-support systems using that data. Most of those ancillary systems are built with Analysis Services. This might not seem mainstream, but the integrated system and tools make it easy to build these analysis systems. This is a reasonable way to operate.

Do you see Microsoft technologies gaining greater acceptance among academic computer scientists? Do you think the ubiquity of Unix in the academic world is a hindrance to Microsoft's success in the business world?
All my academic friends have a Windows system for their PowerPoint presentations. But they often have a Solaris or Linux box where they do their programming. Certainly it would be better for Microsoft if everyone used only our stuff, but that will never happen. In the SkyServer project the heterogeneity costs us a lot of energy. We have to support Netscape 4-6 and Opera. All the client tools have to work on Linux. That makes it much more complex to deploy apps. But that's the reality. As for Microsoft making progress in academia, that is certainly happening, and it is mostly being driven by our tools people. Academics are working with languages like Eiffel to make them compatible with our .NET technologies so that Visual Studio .NET is a great environment for programming with any language. The benefit to the academics and their students is that they get a robust, highly usable environment: a debugger, integrated help, rich editing capabilities, and so on.

What about the future of TerraServer? Will Microsoft continue to expand Earth coverage? Use it to break out terabytes of data into petabytes? (Petabytes isn't even in Microsoft Word's dictionary.)
The TerraServer is five years old now. It has become part of the HomeAdvisor site, and the HomeAdvisor folks have done a very nice job of integrating it with demographic data about your neighborhood. It continues to be one of the 1,000 most popular Web sites—quite amazing. It is also a poster child for large SQL Server databases (12 TB) and failover clustering (a 4x8 Compaq Windows Data Center cluster). From a .NET perspective, it has also morphed into a Web service. That service is a great prototype for anyone interested in building an image-based Web service. Very cool if you have not visited it.

What level of detail will the SkyServer contain? With that kind of data in a single store, what kind of relationships do you expect it to enable astronomers to discover? Or is that the focus of the project? What is the source of the data, and what kinds of conversions are you doing? Will Analysis Services play a role?
SkyServer is like TerraServer but looking the other way. It has a 0.25 arcsecond picture of the northern sky (incomplete right now, but that is where we are heading). In addition to the pixel data, it will have about 200 million galaxies, and about a million spectra. It is a data mining challenge. So far we have been using pure SQL, but we are experimenting with cubes and data mining. The project focus is to understand the structure of the early universe, compute the cosmological constant to an extra decimal place, and perhaps find out about dark stuff (the other 95 percent of the stuff in the universe). Now that we have the SkyServer working, we are converting it to a Web service, and working to federate it with the other astronomy archives to convert the Internet to being the World Wide Telescope. That's where I am putting most of my energies at this point. It is a very good test of the Web services concepts.

What do you see for these kinds of databases in the next decade? What will we be doing with data then?
When data volumes get large, you can't grep them any more, you have to use an index. That's when you need some kind of database. My group has been trying to figure out how much you should put in the database—everything or just the indices. As I get into it, I find it is a lot easier to put everything in the database: photos, videos, images, and of course, text. Sometimes it is just BLOb data, but often you want to "parse" the data and extract as much meta information as possible. For example, we built a database that has all our personal photos. The classic design would have these photos in a file system and just the metadata in the database. But it is so much easier to manage when everything is in the database. There is only one thing to back up and to secure-everything is just simpler. There are fewer design decisions to make. My prediction is that eventually all storage systems will evolve to be database systems.

Aficionados of database platforms such as Oracle and DB2 continue to claim that SQL Server, while cost effective, can't match the performance and reliability of these other platforms running on Unix, Linux, and mainframes. Given your expertise in the database performance and VLDB (very large database) arenas, and your position at Microsoft, I'm assuming you beg to differ. What would your rebuttal be to the Microsoft database doubters out there?
SQL Server's strong points are ease of use and integration with Windows. I think the rank order of things for any database are quality, reliability, manageability/usability, functionality, and performance. Let's take them in turn.

Quality: We have been measuring the quality of SQL Server, Oracle, DB2, Sybase, and Informix since 1996 (Don Slutz's RAGS work). Candidly, back in 1996, none of the products had good quality—they often crashed on complex queries, and they often gave wrong answers. This was a really dirty secret. But they have all improved a lot in the last five years. Now DB2, Oracle, and SQL Server all give the right answer—and the same answer —on almost all queries (one product refuses to believe the null string and the empty string are different and another thinks varchars are at most 255 characters long, but I'm ignoring things like that). So, I think that all the main SQL products now have reasonable quality.

Reliability: How well do the systems lose data? Again, there has been a lot of progress in all the products, but SQL Server had the longest way to come in this space. The Sybase legacy was not good in this area. There were metadata operations not covered by locks, backup killed online performance, and you had to run DBCC a lot. SQL Server 7.0 fixed most of those problems in 1999 and SQL Server 2000 completed that overhaul. At this point, many companies depend on SQL Server and operate it without any fear of lost data. I have been operating TerraServer (terraservice.net) and SkyServer (skyserver.sdss.org) for several years, and the systems have never had a SQL Server failure. That is typical. TerraServer is in fact a 4-node cluster, so we have really good availability. Tom Barclay demos TerraServer node failovers on the live system, which is a testimony to how much we trust the software. So, I think SQL Server reliability is rock solid.

Manageability/Usability: All you need to do is install all three products to see which one is easiest to manage and use. SQL Server is integrated with itself and with the system. All the other products seem to be buy-outs and components that have been "married," and they are "guests" on the host operating systems. This is the upside of being a Windows-only system. The SQL Server tools, the online documentation, and the wizards are all industry-leading. The other vendors are trying to achieve this seamless design, but it is really hard, given the fact that their components were not designed to work together.

Functionality: My sense is that all the products have more functionality that any one person can use. Sure, we all wish for more, and indeed the next release will have more. I for one am still trying to master the many tools that are in the current release. I am using SQL Server almost every day in applications, and I confess that there are few functions I lust after. My main desire is for the integration of C# with SQL Server (to replace T-SQL with a modern, object-oriented language with real datatypes), and for multinode tables.

Performance: SQL Server performance is very good. I have been using it for Web servers and for DSS. We are reading data at 5 million records per second and 350 MB per second on a dual processor system (!). The SQL Server optimizer picks good plans, the execution engine works well, and the whole thing is self-managing. The OLAP engine is astonishingly easy to use and is also very quick.

You can look at the TPC, SAP, and other benchmarks to see that SQL Server has great performance. The TPC-C benchmark shows that SQL Server has 50 percent better performance and two times better price performance than any Unix system. The only close competitor is DB2 on Windows—but SQL Server has the best performance and best price performance of any system on the TPC-C benchmark. In the largest benchmark it is managing a 53-terabyte database and processing over a billion TPC-C transactions per day. I think the performance issue is settled. When you have 32-processor, 32-bit Windows systems running SQL Server and beating 64-processor, 64-bit Solaris systems, I think that really says a lot. And we are moving forward rapidly on the Windows side; IBM and others are at work on Itanium (Intel 64-bit) systems that will contain 64 processors or more. For customers who need more power than what a 32-way system can provide or who want to take advantage of many smaller boxes, we have a scale-out solution on SQL Server 2000 that works today.


Jim Gray
Jim Gray is part of Microsoft's research group. His work focuses on databases and transaction processing. Jim is active in the research community, is an ACM, NAE, NAS, and AAAS Fellow, and received the ACM Turing Award for his work on transaction processing. He is also a member of the PITAC, and edits of a book series of books on data management.