From New Wiki
This section contains synthetic benchmarks.
The dataset we refer to as YahooBenchmark is a collection of simple micro-benchmarks designed in the context of the Yahoo! Cloud Serving Benchmark recently published in SOCC 2010. The goal of the benchmark is to represent data management applications that are simple in nature, but that require high scalability. This benchmark is designed to evaluate internal data management mechanisms for large organizations such as Yahoo, Google, Facebook, and create a simple benchmarking platform for cloud-based data management solutions. In our context, this benchmark is used to provide insight in some of the capabilities of our tool, e.g., its ability fall back to cheaper partitioning strategies in case of a tie, on a standardized and simple test case.
From the five Yahoo! Cloud Serving Benchmark workloads we picked workload A and workload E. Workload A consists of a 50/50 mix of reads/writes on a single tuple chosen from a Zipfian distribution. Despite the potential challenges to achieve performance on Internet-scale data, this is a very easy problem to partition. In fact, with the exception of full-replication, every partitioning strategy leads to zero distribution cost (since transactions touch only one tuple). Hence, the goal of running this test is to show how our system is capable of selecting a cheaper partitioning strategy (e.g., hash partitioning) when one exists.
Workload E consists of a 95-5 mix of read and writes, where the reads perform a short scan (of length uniformly chosen from 1--100), and writes touch a single record. The starting point of the read scan and the write are chosen at random with Zipfian distribution. The goal of using this dataset is showing that hashing fails for range queries, and that our tool can automatically chose range-predicate split points that produce a close-to-optimal partitioning strategy.
In our experimental section we used a benchmark derived from the TPC-C specification . The benchmark we used is an implementation of original TPC-C specification which not take into account some of the ``industrial requirements imposed by the Transaction Processing Performance Council, such as pricing, minimum running time, and performance reporting constraints.
We report a brief summary of the benchmark extracted from the TPC-C specification:
``TPC-C simulates a complete computing environment where a population of users executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses. While the benchmark portrays the activity of a wholesale supplier, TPC-C is not limited to the activity of any particular business segment, but, rather represents any industry that must manage, sell, or distribute a product or service.
The TPC-C schema contains 9 tables with a total of 92 columns, 8 primary keys,and 9 foreign keys. TPC-C workload is based on 5 different types of transactions.
Some initial experiments and analysis of the TPC-C resource utilization are available here Experiments.
The TPC-E specification is available at [www.tpc.org/tpce/]. The benchmark we use is simplification of the original TPC-E benchmark that maintains the spirit and complexity of the original specification. Amongst other differences, we do not take into account some of the ``industrial requirements imposed by the Transaction Processing Performance Council, such as pricing, minimum running time, constraints in performance reporting, and we simplify some of the SQL transactions. We report a brief summary of the benchmark extracted from the TPC-E specification:
``The TPC-E benchmark uses a database to model a brokerage firm with customers who generate transactions related to trades, account inquiries, and market research. The brokerage firm in turn interacts with financial markets to execute orders on behalf of the customers and updates relevant account information. The benchmark is scalable, meaning that the number of customers defined for the brokerage firm can be varied to represent the workloads of different-size businesses. The benchmark defines the required mix of transactions the benchmark must maintain. The TPC-E metric is given in transactions per second E (tpsE). It specifically refers to the number of Trade-Result transactions the server can sustain over a period of time.
The TPC-E schema contains 33 tables with a total of 188 columns, 33 primary keys, 50 foreign keys, and 22 check constraints. The TPC-E workload is based on 10 different types of transactions.
The Dell DVD Store is an open source simulation of an online ecommerce site with implementations in Microsoft SQL Server, Oracle and MySQL along with driver programs and web applications http://linux.dell.com/dvdstore/
This section contains datasets derived from real applications.
Twitter Social Graph
We are currently testing using a (almost complete) snapshot of the twitter social graph (54M nodes, 1.9B relationships). We are testing several partitioning algorithms, since we do not currently have the actual twitter workload, and thus sampling and regular graph partitioning is challenging.
For Wikipedia, we have almost complete knowledge. We have the following starting data:
- the actual wikipedia data from http://dumps.wikimedia.org/ (we use a 2008 dump of about ~4.5TB of data once installed in MySQL)
- the complete application (mediawiki), from which we can derive queries and transactions (yes, wikipedia does use actual transactions!)
- a trace of 10% of all user requests issued to Wikipedia (in all languages) during the period between September 19th 2007 and January 2nd 2008. That we make available here: http://db.csail.mit.edu/wikipedia/
- live monitoring information from the wikipedia DB clusters: http://ganglia.wikimedia.org/?c=MySQL&m=cpu_report&r=hour&s=descending&hc=3&mc=3
- live profiling information from the mediawiki installation at wikipedia: http://noc.wikimedia.org/cgi-bin/report.py?db=all&sort=count&limit=5000
- information on the actual machine configuration used by the wikipedia team: http://noc.wikimedia.org/conf/
Given this information we were able to design a rather sophisticated benchmark. We can sample data and queries to produce smaller subsets (e.g., a 1% sample of the pages ~2M records). By inspecting several papers and blog posts, and interacting with the DB Administrator of Wikipedia we were able to get a rather precise guess of the impact of the various external caches, and we can thus generate a fairly precise mix of queries/transactions, as seen by the MySQL cluster in wikipedia. We are planning to release this benchmark, and the software to drive the test soon.
The epinions.com experiment aims to challenge our system in a case where it is hard to obtain a ``good partitioning. This verifies our effectiveness in discovering intrinsic correlations between data items that are not visible at the schema/query level. The epinions.com schema we consider contains four relations: users, items, reviews, trust, where the reviews relation represents an n-to-n relationship between users and items (capturing user reviews and ratings of items), and the trust relation represents a n-to-n relationship between pairs of users indicating a binary ``trust value. The data was obtained by Paolo Massa from the development team of epinions.com. Since no workload is provided, we derived one from the most popular functionality of the website as follows:
- For logged-in users: given an item provide ratings from the trusted users
- Given a user show the list of the users (s)he trusts
- For anonymous users: given an item provide the weighted average of all the ratings
- For anonymous users: given an item obtain the 10 most popular reviews
- List the 10 most popular reviews of an user
- Insert/Update the user profile
- Insert/Update the metadata of an item
- Insert/Update a review
- Update the trust relations between two users
Ensembl Genetic DB
The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online.
Project homepage: http://www.ensembl.org
Schemas, integrity constraints and patches: http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl/sql/?root=ensembl
Interesting set of queries: http://svn.warelab.org/maize/ensembl/branches/release-3a.50/sql/
Online MySQL servers:
- Server Up to version 47 Version 48 onwards
- ensembldb.ensembl.org 3306 5306
- martdb.ensembl.org 3316 5316
Docuemntation of the schema and system: http://www.ensembl.org/info/docs/api/index.html
We collected a trace of over 10GB of the queries performed against the public online servers.