My lab partner and I are working on writing code to make our own SAT solver using Python for one of our courses. So far we have written this code to convert SoP to CNF. Now we are stuck as to how to convert the CNF to DIMACS format. We understand how the DIMACS format works when completing it by hand but we are stuck on writing the actually code to go from CNF to DIMACS. Everything we have found so far inputs files that are already in the DIMACS format.
from sympy.logic.boolalg import to_cnf
from sympy.abc import A, B, C, D
f = to_cnf(~(A | B) | D)
g = to_cnf(~A&B&C | ~D&A)
The sympy boolalg module lets you build an abstract syntax tree (AST) for the expression. In CNF form you'll have a top-level
And
node, with one or more children; each child is anOr
node with one or more literals; a literal is either aNot
node of a symbol, or just a symbol directly.From the DIMACS side, the top-level
And
is implicit. You are just listing theOr
nodes, and if a symbol was in aNot
you mark that with a-
before the symbol's variable. You are essentially merely assigning new names for the symbols and writing it down in a new text form. (The fact that DIMACS variable names look like integers is just because it's convenient; they do not have integer semantics/arithmetic/etc.)To track mapping between DIMACS variables and sympy symbols, something like this is helpful:
You can always ask for a new (fresh, never been used) variable with
new_variable
. DIMACS variables start from 1, not 0. (The 0 value is used to indicate not-a-variable, primarily for marking the end-of-clause.)We don't want to just allocate new variables every time, but also remember which variables were assigned to a symbol. This maintains a mapping from symbol to variable and vice versa. You hand a sympy symbol to
get_variable_for
and either the previously used variable for that symbol is returned, or a new variable is allocated and returned, with the mapping noted.It tracks the reverse mapping, so you can recover the original symbol given a variable in
get_symbol_for
; this is useful for turning a SAT assignment back into sympy assignments.Next, we need something to store this mapping along with the clause list. You need both to emit valid DIMACS, since the header line contains both the variable count (which the mapping knows) and the clause count (which the clause list knows). This is basically a glorified tuple, with a
__str__
that does the conversion to well-formed DIMACS text:Last, we just walk over the sympy AST to create DIMACS clauses:
This just descends down the tree, until it gets the root symbol and whether or not it was negated (i.e., was in a
Not
indicating negative polarity). Once the symbol is mapped to its variable, we can leave it positive or negate it to maintain polarity and append it to the DIMACS clause.Do this for all
Or
nodes and we have a mapped DIMACS formula.As an aside, you probably don't want to use
to_cnf
to get a CNF for purposes of testing satisfiability. In general, converting a boolean formula to an equivalent CNF can result in exponential size increase.Note in the above example for
f
, variableD
only appeared once in the formula yet appeared twice in the CNF. If it had been more complicated, like(C | D)
, then that entire subformula gets copied:If it was even more complicated, you can see how you end up with copies of copies of copies... and so on. For the purposes of testing satisfiability, we do not need an equivalent formula but merely an equisatisfiable one.
This is a formula that may not be equivalent, but is satisfiable if and only if the original was. It can have new clauses and different variables. This relaxation gives a linear sized translation instead.
To do this, rather than allow subformulas to be copied we will allocate a variable that represents the truth value of that subformula and use that instead. This is called a Tseitin transformation, and I go into more detail in this answer.
As a simple example, let's say we want to use a variable
x
to represent(a ∧ b)
. We would write this asx ≡ (a ∧ b)
, which can be done with three CNF clauses:(¬x ∨ a) ∧ (¬x ∨ b) ∧ (¬a ∨ ¬b ∨ x)
. Nowx
is true if and only if(a ∧ b)
is.This top-level function kicks off the process, so that the recursive calls share the same mapping and clause set. The final outcome is a single variable representing the truth value of the entire formula. We must force this to be true (else a SAT solver will simply choose any input variables to the formula, follow the implications, and produce an evaluated formula of any output).
The bulk of the translation is the code that adds clauses specific to the operation being performed. The recursion happens at the point where we demand a single variable that represents the output of the subformula arguments.
Now boolean formulas do not need to be in CNF to become DIMACS:
Even after unit propagation the resulting formula is clearly larger in this small example. However, the linear scaling helps substantially when 'real' formulas are converted. Furthermore, modern SAT solvers understand that formulas will be translated this way and perform in-processing tailored toward it.