Demo: Combine and Improve TPTP Proofs

From Inference Web

Jump to: navigation, search


warning.pngEmpty strings are not accepted. ({{#vardefine:ns| }}{{#vardefine:p|type }}{{#vardefine:v|Demo }}) [ Edit ] {{#vardefine:ns| }}{{#vardefine:p|homepage }}{{#vardefine:v| }} {{#vardefine:ns| }}{{#vardefine:p|description }}{{#vardefine:v|combine and improve TPTP proofs }} {{#vardefine:ns| }}{{#vardefine:p|location }}{{#vardefine:v| }} {{#vardefine:ns| }}{{#vardefine:p|relation }}{{#vardefine:v|TPTP-IW }} {{#vardefine:ns| }}{{#vardefine:p|tag }}{{#vardefine:v| }} {{#vardefine:ns| }}{{#vardefine:p|alias }}{{#vardefine:v| }} {{#vardefine:ns| }}{{#vardefine:p|equal_to }}{{#vardefine:v| }}



In many real world settings, different solutions are provided for the same problem. Users sometimes need to evaluate the answer and the justification for the answer. They may want to have a presentations of the justification that vary according to understandability and dependencies. Our research focuses on providing justifications for answers from intelligent systems and also for providing different ways of pruning and presenting those justifications.

This page focuses on ways of providing and manipulating justifications with the goal of providing users with tools for accessing and understanding those justifications. Our work has used many different kinds of reasoning systems. In the rest of this page, we focus on a certain kind of reasoning system - theorem provers. The work however can be and has been applied in other kinds of systems including information extraction and learning systems.

Different reasoning systems often find different solutions to the same problem. In a formal sense, they find different proofs for the same theorem. This work examines two proofs by combining subproofs that have the same (intermediate) conclusion, and then uses the "better" subproof to replace the other "worse" subproof. As a result, the proof containing the "worse" subproof is improved. There are many possible metrics for evaluating (sub)proof quality, e.g., number of inferences, number of axioms used, complexity of terms in the proof, etc.

Thousands of Problems for Theorem Provers

TPTP Dataset

This demo used the TPTP problem library - a library of test problems for first-order ATP systems. The proofs are taken from the TSTP solution library - a library of ATP systems' solutions to TPTP problems.

  • TPTP - Thousands of Problems for Theorem Provers
  • TSTP - Thousands of Solutions for Theorem Provers
  • ATP - Automated Theorem Proving

The "Agatha" problem

Throughout this demo, we use the "agatha" problem that was given by Len Schubert [1]. For ATPs with an ‘answer extraction’ mechanism, the conclusion might replaced with the query

‘who killed Aunt Agatha?’


(1) Someone who lives in Dreadsbury Mansion killed Aunt Agatha.
(2) Agatha, the butler, and Charles live in Dreadsbury Mansion, and are the only people who live therein.
(3) A killer always hates his victim, and is never richer than his victim.
(4) Charles hates no one that Aunt Agatha hates.
(5) Agatha hates everyone except the butler.
(6) The butler hates everyone not richer than Aunt Agatha.
(7) The butler hates everyone Agatha hates.
(8) No one hates everyone.
(9) Agatha is not the butler.

Expected Answer

Agatha killed herself.

[1] Pelletier, F. J. 1986. Seventy-five problems for testing automatic theorem provers. J. Autom. Reason. 2, 2 (Jun. 1986), 191-216. DOI=

Example: agatha (ep-otter)

We start with a simple example which combined two proofs and used a simple proof quality metric, i.e. counting the number of inference steps in the proof including the axiom steps as well as intermediate steps.

Step A. Choose two original proofs

The chosen TPTP problem is the clause normal form problem PUZ001-1, regarding the murder of Aunt Agatha in Dreadbury Manision, where she, Charles, and the butler live together. The problem provides 11 axioms to ATP systems such as

cnf(killer_hates_victim,hypothesis, ( ~ killed(X,Y) | hates(X,Y) )). -- (3)
cnf(agatha_hates_charles,hypothesis, ( hates(agatha,charles) )). --(5)

Moreover, the problem also used a negated conjecture that denies that Charles or the butler murdered Aunt Agatha, which proves that Aunt Agatha killed herself ...

cnf(prove_neither_charles_nor_butler_did_it,negated_conjecture, ( killed(butler,agatha) | killed(charles,agatha) )).

We choose two proofs for this problem by two ATP systems (by refutation) that use different calculi and different inference rules, resulting in different proofs.

  • EP - it uses fine grained inference rules such as rewriting, simplification, clause normalization, simultaneous paramodulation, and superposition.
  • Otter - it uses more coarse grained rules such as hyperresolution.

Legend for SVG (DAG view)

  • diamond denotes an inference step
    • the name of inference rule is printed as node label
    • each diamond is click-able, linking to iwbrowser-local
    • the "lightgrey" fill-color denotes a leaf (e.g. axiom).
    • the border color identifies the inference engine used:
Ayane -> green EP -> yellow Faust -> brown
Metis -> red Otter -> purple SNARK -> cyan
SOS -> blue Vampire -> grey
  • rectangle denotes a formula
    • the content of it is printed as node label
    • the "grey" border-color shows it is using TPTP-FOF language
    • the "green" text-color shows it is new from the original
    • the "red" fill-color shows it has alternative derivations
  • arc denotes the derivation dependencies among nodes
    • an arc from a rectangle to a diamond shows the formula was derived by the step
    • an arc from a diamond to a rectangle shows the step used the formula as antecedent

The proofs generated by the two ATP systems form DAGs as show below.

The original EP proof The original Otter proof
steps 24 17
axioms 10 10
unique formula 24 17
unique rules 7 2

The original EP proof

SVG iwbrowser Probe-It! IDV
Original EP proof (SVG) plotted by Dot

Original EP plotted by iwbrowser

Original EP plotted by Probe-It!

Original EP plotted by IDV

The original Otter proof

SVG iwbrowser Probe-It! IDV

Original Metis proof (SVG) plotted by Dot

Original Otter proof plotted by iwbrowser

Original EP plotted by Probe-It!

Original EP plotted by IDV

Step B Compare Proof Nodes

Semantically equivalent formula can be found within a solution or across different solutions towards the same problem. The semantic equivalence relation in this work were derived by a simple first order logic formula normalization tool. The table below lists some examples of equivalent formula after normalization.

normalized original 1 original 2
hates(butler, butler) hates(butler, butler) (EP) hates(butler, butler) (Otter)
$false $false|$false (EP) $false (EP)
~ killed(X001,X002) | ~ richer(X001,X002) ~ killed(X1,X2) | ~ richer(X1,X2) (EP) ~ killed(A,B) | ~richer(A,B) (Otter)

With the the semantic equivalence relations, we can combine subproofs within a proof or from many different proofs.

Step C Self Improvement

We first show a simple improvement within a proof. Some reasoners, such as EP and Metis, does keep semantically equivalent intermediate conclusions in proofs to preserve reasoning details such as formula normalization or rewrite. In what follows, we show an example from the EP proof (presented in step A) that ends with a sequence of $false conclusions, completing the refutation.

The following table shows that there are three different subproofs in the EP proof deriving the same conclusions. While both #subproof-A and #subproof-B lead to the $false conclusion, subproof-B, being a subproof of subproof-A, is clearly better. Thus the EP proof can be "self-improved" by eliminating two inference steps. by replacing the larger subproof by the smaller subproof.

EP's subproof-A (the original) has 24 steps , ending with a sequence of $false conclusions.

subproof-A (roginal proof)

EP's subproof-B (subproof of the original) has 22 steps , ending with one $false|$false conclusion.

subproof-B (better proof)

After combine, two steps were removed

SVG iwbrowser Probe-It!

Original EP plotted by Probe-It!

Step D Global Improvement

The second pair of subproofs have the conclusion hates(butler, butler). The EP's #subproof-C has 14 nodes, while the Otter's #subproof-D has 11 nodes. The Oter proof is smaller thanks to the use of hyper-resolution inference steps that effectively combine multiple binary resolution steps done in the EP proof. Additionally the two proofs have quite different structures, e.g., the EP proof initially concludes killed(butler, agatha) and proceeds from there to the conclusion hates(butler, butler), while the Otter proof never produces (either explicitly, or implicitly as an intermediate step of hyper-resolution) the conclusion killed(butler, agatha). These are shown in the IWBrowser presentations below. Thus the EP proof can be improved by "combining" it with the Otter proof, by replacing the EP's subproof-C with Otter's subproof-D.

The EP's subproof-C has 14 nodes, ending with the conclusion hates (butler, butler)

The Otter's subproof-D has 11 nodes, ending with the conclusion hates (butler, butler).

The table below shows the result of improvement and how the original EP proof was improved:

SVG iwbrowser IDV

Example: agatha (EP and Faust)

In this example, we show how two proof may help each other, and we can further derive a complete new proof which is better than any of the original proof.

Example: agatha (Evolution of EP)

In this example, we show we can compare different versions of proofs generated by the same inference engine - EP. We preserved four different versions of EP each of which generated different proofs

  • EP 0.999
  • EP 1.0
  • EP 1.1pre
  • EP 1.1

We found that EP 1.1pre was helped by EP 1.0, why?


Using More Quality Metrics

We now are counting just on the least number of steps in the derivation, and we can extend the measure to a broader scope

  • find me a proof that uses some new statements outside the original proofs
  • find me a proof that uses new axioms(sources)
  • find me two proofs that looks very different (using different sets of axioms or statements)

Current and Potential Applications

The example above demonstrates the basic proof combining technique that is possible thanks to the PML encoding of the reasoning. Extensions, efficient implementation, and applications are the focus of current research. Application of proof combining in various domains are of particular interest. One immediate application being actively pursued is combining proofs taken from the Mizar Mathematical Library (MML) of formalized mathematics. The theorems in the MML have been exported to TPTP format in two ways (work by Josef Urban). First, the theorems in the MML have been used to create TPTP problems that can be solved by ATP systems. ATP systems' proofs of these are being combined. In this setting it is possible to combine proofs of the same theorem by different ATP systems, and also proofs of different theorems, thanks to the common axiomatic background. This provides exciting opportunities for finding better proofs. Second, the human proofs in the MML have been exported as TSTP proofs. These too can be combined, with the potential for improving the original human proofs.

In the future, possible applications include combining proofs (both formal and informal) in areas such as integrated learning systems, scientific workflow systems, science data integration systems, policy-aware systems, intelligence analysis using hybrid logic-based systems, and task processing systems.


Facts about Demo: Combine and Improve TPTP ProofsRDF feed
Namewarning.pngEmpty strings are not accepted.
Personal tools