About Me
Twitter
LinkedIn
Github
Talks
Email

How To Quickly Setup Virtuoso + DBpedia

DBpedia is a Linked Data project with the goal to extract structured content from Wikipedia and make it available in a Knowledge Base that, for the English language, currently describes more than 4.0 million things, out of which 3.22 million are classified in a consistent ontology, including 832,000 persons, 639,000 places (including 427,000 populated places), 372,000 creative works (including 116,000 music albums, 78,000 films and 18,500 video games), 209,000 organizations (including 49,000 companies and 45,000 educational institutions), 226,000 species and 5,600 diseases. [1][2] In a few words, a lot of interesting data, ready to be looked at!

OpenLink Virtuoso Open-Source Edition is an enterprise grade multi-model data server for agile enterprises and individuals [3]. In particular, it is widely adopted as a triple-store with a SPARQL endpoint.

In order to run interesting experiments on the DBpedia Knowledge Base, one may want to setup a server locally, to query the KB similarly to what is provided by the online services offered by DBpedia.

Since both Janez Starc and I felt one could get lost in the extensive online documentation, we decided to write a recap of the key steps required to setup Virtuoso + DBpedia.

Before starting, it is worth mentioning that there are two versions of Virtuoso: a commercial version, and an open-source edition. The commercial version has a free evalutation period (15 days); the open-source version is distributed with a GPL2 license, plus a couple of extra provisions. The enterprise edition comes with some extra tools (e.g., data integration), but the open-source edition seems to have all the features necessary for dealing with RFD and OWL triples, including a SPARQL endpoint. A comparison of the two versions is available here.

So let's start!

[1]: http://dbpedia.org/About
[2]: https://en.wikipedia.org/wiki/DBpedia
[3]: http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/</sub

Installation

First, we need to install the software on the local machine:

Using a Package Manager

If the machine is running a reasonably maintained Linux distribution, most likely there will be a package available to be installed with the package manager of choice (we could verify Debian, Ubuntu, and RHEL 6). On Mac Os X, HomeBrew made the installation as easy.

Building from Source

Virtuoso supports a classic configure; make; make install installation. We recommend to specify the following options during the configure phase:

--with-readline       # use the system-provided readline library 
--enable-isparql-vad  # enable a SPARQL web interface at http://localhostP8890/sparql

Other configuration options are available by running ./configure --help.

Server Configuration

Next, customize the virtuoso.ini file for the environment of the local machine so that it will perform well when loading and managing large quantities of data that are typical of knowledge bases like DBpedia, Yago, Freebase, etc. A description of all the parameters is available here, while this is the original document illustrating some recommended settings for configuring Virtuoso at scale.

Watch out: in some instances of Virtuoso the parameter names must be at the beginning of the line (no leading whitespaces), otherwise they may be ignored.

[Database] Section

Then, Specify the paths to the knowledge base files, making sure to select paths to directories that have the appropriate permissions and on volumes having enough free storage:

[Database]
DatabaseFile
ErrorLogFile
LockFile
TransactionFile
xa_persistent_file

Also, specify from which directories Virtusoso is allowed to read in order to load data sets:

DirsAllowed = /path/to/1, /path/to/2, ...

[Parameters] Section

Now, specify the amount of RAM (8k pages) used by Virtuoso to cache database files and the the maximum number of modified buffers to store before writing.

;; Note the default settings will take very little memory
;; but will not result in very good performance
;;
NumberOfBuffers          = 10000
MaxDirtyBuffers          = 6000

Note that, both properties are used to specify the amount of system memory that will be dedicated to Virtuoso. The instructions on how to set these parameters are in the .ini file. The amount of dedicated system memory can be less than the size of the database file, virtuoso.db, which contains all the data and indices. If the amount of dedicated memory is several times lower than the db file it will probably degrade the performance of loading and querying, especially if you have slow disk I/O. A good proxy for estimating the size of virtuoso.db, is the size of your uncompressed dataset files stored in the N-Triples RDF format.

Next, specify the number of 8K pages by which the database grows in size when it reaches capacity.

FileExtend = 200 ; My default is 200

Note that, increasing the value will increase the speed of loading.

Once the .ini file is prepared you can start the Virtuoso server with the command virtuoso-t. The server can be controlled from the command line with the command isql-vt or from a web browser at http://localhost:8890/conductor if you use the default server parameters.

Loading DBpedia

DBpedia is composed of different data sets. The most basic ones include the DBpedia Ontology (definitions of types and properties), the Mapping-based Types (types of DBpedia resources), and the Mapping-based Resources (properties of DBpedia resources). These sets should be sufficient to get started and to run some interesting experiments.

Other data sets are also available, such as the Raw Infobox Properties, but keep in mind that they may be less curated than the Mapping-Based properties. There are also data sets for languages different than English, and files containing links to other knowledge bases, like Yago, OpenCyc, Freebase, the CIA World Factbook, etc. In particular, these links make effectively DBpedia a powerful Linked Data hub.

All the data sets are available in these formats:

.nt  - N-Triples
.nq  - N-Quads 
.ttl - Turtl

and they can be downloaded from http://wiki.dbpedia.org/Downloads.

Virtuoso supports directly all of the above-said formats. Note that files in the N-Quads format, .nt, are larger in size due to the provenance information stored as the fourth element (the other formats represent only triples).

Once the data sets of interest are downloaded, they must be decompressed. We recommend, for simplicity, to put them all into one directory. Since OpenLink's extensive documentation about Virtuoso can be hard to navigate, here is the direct link to the bulk RDF loading tutorial.

We will need to assign a graph IRI. Think of it as a namespace. According to the tutorial the IRI string can be specified: by specifying it directly in the ld_dir function; by creating a .graph file for each data set; or by specifying the IRI within a global.graph file, if the data set does not have an associated graph file. Keep in mind that having separate graph IRIs for each data set requires to query each data set separately using the SPARQL FROM clause.

If other data sets are stored together with DBpedia, we recommend specifying one graph IRI (e.g., http://your.organization.org/dbpedia) for all of the DBpedia datasets. Using the global.graph file is an easy waay to accomplish that.

Bulk loading supports all common RDF data formats (even gzip'd ones!). Make sure that all the directories from which Virtuoso loads data are specified in the virtuoso.ini in the AllowDirs option.

If graph.local is used to specify the IRI(s) for your graph(s), the data can be loaded in bulk from isql. Running the command:

select * from DB.DBA.load_list;

from within `isql, gives a hint on the progress of the loading operation.

Useful isql Commands

Openlink documentation on some the functions available through isql is available here. We report some of the most useful ones:

SQL>status(); # See if the server is running and the commands that are running  
SQL> RDF_GLOBAL_RESET (); # Empty the database/remove all triples (!!!).
SQL> txn_killall (1); # Kill all the previoulsy issued transactions that are still running.
SQL> shutdown(); # Stop Virtuoso.

isql can also be used to issue SPARQL queries on the triples loaded in the database.

For example:

SQL> SPARQL SELECT * WHERE {?s ?p? ?q} LIMIT 10 # A simple SPARQL query

Feedback

Please let us know if this document helped you (or not!) in any way, and feel free to share any feed back you may have by commenting below, or by email.

comments powered by Disqus