IW Meeting 2011-04-15

From Inference Web

Jump to: navigation, search

Contents

Meeting info

Attendees

  • Tim
  • Leo
  • Nick
  • Jitin

Agenda

Sources and Sinks

  • Tim
  • Leo

Discussion

Tim needs to capture the provenance of a triple store loading an RDF file from the web. pvload.sh is a part of csv2rdf4lod and implements the current design. For today's discussion, we'll load Leo's foaf file into LOGD's Virtuoso endpoint. The usage for pvload is:

$ pvload.sh 
usage: pvload.sh [-n] url [-ng named_graph]
  -n  : dry run - do not download or load into named graph.
  url : the URL to retrieve and load into a named graph.
  -ng : the named graph to place 'url'. (if not provided, -ng == 'url').

The advantage of using pvload.sh over traditional loading methods is that the provenance of the named graph load is included into the named graph itself. So, for any query one executes against a set of named graphs, one can also inquire about how the information got there in the first place instead of being forced to trust it blindly. Thus, we have named graphs that know where they came from. Part of this solution involves reusing the upcoming sparql 1.1 service description vocabulary to establish a canonical URI for a named graph within a particular SPARQL endpoint.

Running the following command loads Leo's foaf file into the LOGD endpoint:

[lebot@sam ~]$ pvload.sh http://www.cs.utep.edu/leonardo/foaf.rdf

---------------------------------- pvload ---------------------------------------

---------------------------------- pcurl ---------------------------------------
PCURL: url                http://www.cs.utep.edu/leonardo/foaf.rdf
PCURL: url basename       foaf.rdf
PCURL: -n localname       _pvload.sh1302898500_22771.response
PCURL: basename localname _pvload.sh1302898500_22771.response
getting last mod xsddatetime
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  1484    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
getting redirect name
200 OK (location: http://www.cs.utep.edu/leonardo/foaf.rdf)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  1484    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
PCURL: http redirect basename foaf.rdf
getting last mod
http://www.cs.utep.edu/leonardo/foaf.rdf (mod 2009-11-09T23:11:17)
http://www.cs.utep.edu/leonardo/foaf.rdf (mod 2009-11-09T23:11:17) to _pvload.sh1302898500_22771.response (@ 2011-04-15T16:15:02-04:00)
curl -L http://www.cs.utep.edu/leonardo/foaf.rdf REDIRECT_ANGLE _pvload.sh1302898500_22771.response
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1484  100  1484    0     0  11164      0 --:--:-- --:--:-- --:--:--     0

--------------------------------------------------------------------------------
PVLOAD: url                http://www.cs.utep.edu/leonardo/foaf.rdf
PVLOAD: (URL) http://www.cs.utep.edu/leonardo/foaf.rdf -> (Named Graph) http://www.cs.utep.edu/leonardo/foaf.rdf
rapper: Parsing URI file:///home/lebot/_pvload.sh1302898500_22771.response with parser guess
rapper: Serializing with serializer ntriples
rapper: Guessed parser name 'rdfxml'
rapper: Parsing returned 23 triples

--------------------------------------------------------------------------------
sudo /opt/virtuoso/scripts/vload nt _pvload.sh1302898500_22771.response.nt http://www.cs.utep.edu/leonardo/foaf.rdf
Password: 
Loading finished! Check /tmp/virtuoso-tmp/vload.log for details.
sudo /opt/virtuoso/scripts/vload ttl _pvload.sh1302898500_22771.response.pml.ttl http://www.cs.utep.edu/leonardo/foaf.rdf
Loading finished! Check /tmp/virtuoso-tmp/vload.log for details.
sudo /opt/virtuoso/scripts/vload ttl _pvload.sh1302898500_22771.response.load.pml.ttl http://www.cs.utep.edu/leonardo/foaf.rdf
Loading finished! Check /tmp/virtuoso-tmp/vload.log for details.

To illustrate the PML provenance encoding, Tim hosted the intermediate files:

The problem in the current design is that the sd:NamedGraph is the NodeSet's conclusion, when we actually want to justify the information WITHIN the named graph. This becomes especially concerning when loading multiple files into the same named graph. According to PML, we'd have multiple justifications for the same Information, but in reality the union of the InferenceSteps justifies the information within the named graph.

After a good discussion, we determined that we need to mint an Information by combining the named graph URI and a particular timestamp. Then each subsequent load is justifying a different set of Information. Then, each of these Information instances can associate to the sd:NamedGraph if needed. This requires pvload to query the named graph it is GOING to populate, obtain the latest justification for its information, and reference that NodeSet in the provenance it includes when actually loading the subsequent RDF file.

<http://logd.tw.rpi.edu/sparql?query=PREFIX%20sd%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2Fns%2Fsparql-service-description%23%3E%20CONSTRUCT%20%7B%20%3Fendpoints_named_graph%20%3Fp%20%3Fo%20%7D%20WHERE%20%7B%20GRAPH%20%3Chttp%3A%2F%2Fwww.cs.utep.edu%2Fleonardo%2Ffoaf.rdf%3E%20%7B%20%5B%5D%20sd%3Aurl%20%3Chttp%3A%2F%2Flogd.tw.rpi.edu%3A8890%2Fsparql%3E%3B%20sd%3AdefaultDatasetDescription%20%5B%20sd%3AnamedGraph%20%3Fendpoints_named_graph%20%5D%20.%20%3Fendpoints_named_graph%20sd%3Aname%20%3Chttp%3A%2F%2Fwww.cs.utep.edu%2Fleonardo%2Ffoaf.rdf%3E%3B%20%3Fp%20%3Fo%20.%20%7D%20%7D>
   a sd:NamedGraph;
   sd:name <http://www.cs.utep.edu/leonardo/foaf.rdf>;
.

<nodeSet_507059e1-e63d-47d7-abb4-5872f867d54a>
   a pmlj:NodeSet;
   pmlj:hasConclusion <http://logd.tw.rpi.edu/sparql?query=PREFIX%20sd%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2Fns%2Fsparql-service-description%23%3E%20CONSTRUCT%20%7B%20%3Fendpoints_named_graph%20%3Fp%20%3Fo%20%7D%20WHERE%20%7B%20GRAPH%20%3Chttp%3A%2F%2Fwww.cs.utep.edu%2Fleonardo%2Ffoaf.rdf%3E%20%7B%20%5B%5D%20sd%3Aurl%20%3Chttp%3A%2F%2Flogd.tw.rpi.edu%3A8890%2Fsparql%3E%3B%20sd%3AdefaultDatasetDescription%20%5B%20sd%3AnamedGraph%20%3Fendpoints_named_graph%20%5D%20.%20%3Fendpoints_named_graph%20sd%3Aname%20%3Chttp%3A%2F%2Fwww.cs.utep.edu%2Fleonardo%2Ffoaf.rdf%3E%3B%20%3Fp%20%3Fo%20.%20%7D%20%7D>;
   pmlj:isConsequentOf <infStep_507059e1-e63d-47d7-abb4-5872f867d54a>;
.

<infStep_507059e1-e63d-47d7-abb4-5872f867d54a>
   a pmlj:InferenceStep;
   pmlj:hasAntecedentList ( [ a pmlj:NodeSet; pmlj:hasConclusion <http://www.cs.utep.edu/leonardo/foaf.rdf>; ] );
   pmlj:hasInferenceRule <http://inference-web.org/registry/MPR/TRIPLE_STORE_LOAD.owl#>;
   oboro:has_agent          <http://tw.rpi.edu/web/inside/machine/sam#lebot>;
   hartigprov:involvedActor <http://tw.rpi.edu/web/inside/machine/sam#lebot>;
   dcterms:date "2011-04-15T16:19:41-04:00"^^xsd:dateTime;
.

A good follow-up demonstration for what we learned here is to load the FOAF files of Leo, Tim, Nick, and Jitin and show how subsequent loads into the same named graph are justified by annotating the previous PML-J justifications.

Raw discussion notes

different information at different times.


Leo: containers are references to where information is coming from.


CONSIDER: name the information for the subset of the named graph for JUST that one invocation of the load. then associate the three information instances into the named graph.


Leo: Source on SAWs. Container of Information can be a Person. Tim does not have full control over Person. e.g. person interacting with software. state of mind for person, they provide parameters - but don't know state of mind. in workflows, containers are places to load data for get data from. (but NOT assuming you have full control over the container)


Tim has full control over the container. Leo considering situations where someone does NOT have full control. e.g. File system directory. part of process is to get a file_1 and file_2. - but don't know about other files available. capturing provenance of how file got into the system originally - need guarantee that nobody else is manipulating file system. cannot provide provenance for antecedents (files)


PRO / CON

full control

  • can justify everything that is included in the container


less control

  • open to reusing containers (those not part of specific process)
  • can use containers for other things.


Jitin: "what if a process is building data artifact incrementally" Tim: we need to Jitin: augment the list in pml-j's hasAntecedent list?

(rdf:lists are closed, rdf:containers are open)


Tim design objective: minimizing the state needed when asserting subsequent loads.

do we need to capture the sequence of loads?

we are justifying the union dataset of the result, NOT the subset that corresponds to


<infStep_507059e1-e63d-47d7-abb4-5872f867d54a>
   a pmlj:InferenceStep;
   pmlj:hasAntecedentList ( [ a pmlj:NodeSet; pmlj:hasConclusion <http://www.cs.utep.edu/leonardo/foaf.rdf>; ] );
   pmlj:hasInferenceRule <http://inference-web.org/registry/MPR/TRIPLE_STORE_LOAD.owl#>;
   oboro:has_agent          <http://tw.rpi.edu/web/inside/machine/sam#lebot>;
   hartigprov:involvedActor <http://tw.rpi.edu/web/inside/machine/sam#lebot>;
   dcterms:date "2011-04-15T16:19:41-04:00"^^xsd:dateTime;
# TODO: Tim this should have pmlp:hasCreationDateTime
.


subsequent justifications incorporate provenance of previous

named graph is just the cohesive. named graph is actually three Informations.

TODO: Tim need to name the timestamp the Info when I name it, then group the 3 informations into the named graph.

PML


DUH: just reference original nodeset (previous nodeset)

TODO: Tim create DeclarativeRule: RDF UNION in IW Registry



Next week:

3pm PM EST skype.

Personal tools
Navigation