Converting CNF format to DIMACS format

1.6k Views Asked by At

My lab partner and I are working on writing code to make our own SAT solver using Python for one of our courses. So far we have written this code to convert SoP to CNF. Now we are stuck as to how to convert the CNF to DIMACS format. We understand how the DIMACS format works when completing it by hand but we are stuck on writing the actually code to go from CNF to DIMACS. Everything we have found so far inputs files that are already in the DIMACS format.

    from sympy.logic.boolalg import to_cnf
    from sympy.abc import A, B, C, D
    f = to_cnf(~(A | B) | D)
    g = to_cnf(~A&B&C | ~D&A)
1

There are 1 best solutions below

5
On

The sympy boolalg module lets you build an abstract syntax tree (AST) for the expression. In CNF form you'll have a top-level And node, with one or more children; each child is an Or node with one or more literals; a literal is either a Not node of a symbol, or just a symbol directly.

From the DIMACS side, the top-level And is implicit. You are just listing the Or nodes, and if a symbol was in a Not you mark that with a - before the symbol's variable. You are essentially merely assigning new names for the symbols and writing it down in a new text form. (The fact that DIMACS variable names look like integers is just because it's convenient; they do not have integer semantics/arithmetic/etc.)

To track mapping between DIMACS variables and sympy symbols, something like this is helpful:

class DimacsMapping:
    def __init__(self):
        self._symbol_to_variable = {}
        self._variable_to_symbol = {}
        self._total_variables = 0

    @property
    def total_variables(self):
        return self._total_variables

    def new_variable(self):
        self._total_variables += 1
        return self._total_variables

    def get_variable_for(self, symbol):
        result = self._symbol_to_variable.get(symbol)
        if result is None:
            result = self.new_variable()
            self._symbol_to_variable[symbol] = result
            self._variable_to_symbol[result] = symbol

        return result

    def get_symbol_for(self, variable):
        return self._variable_to_symbol[variable]

    def __str__(self) -> str:
        return str(self._variable_to_symbol)

You can always ask for a new (fresh, never been used) variable with new_variable. DIMACS variables start from 1, not 0. (The 0 value is used to indicate not-a-variable, primarily for marking the end-of-clause.)

We don't want to just allocate new variables every time, but also remember which variables were assigned to a symbol. This maintains a mapping from symbol to variable and vice versa. You hand a sympy symbol to get_variable_for and either the previously used variable for that symbol is returned, or a new variable is allocated and returned, with the mapping noted.

It tracks the reverse mapping, so you can recover the original symbol given a variable in get_symbol_for; this is useful for turning a SAT assignment back into sympy assignments.

Next, we need something to store this mapping along with the clause list. You need both to emit valid DIMACS, since the header line contains both the variable count (which the mapping knows) and the clause count (which the clause list knows). This is basically a glorified tuple, with a __str__ that does the conversion to well-formed DIMACS text:

class DimacsFormula:
    def __init__(self, mapping, clauses):
        self._mapping = mapping
        self._clauses = clauses

    @property
    def mapping(self):
        return self._mapping

    @property
    def clauses(self):
        return self._clauses

    def __str__(self):
        header = f"p cnf {self._mapping.total_variables} {len(self._clauses)}"
        body = "\n".join(
            " ".join([str(literal) for literal in clause] + ["0"])
            for clause in self._clauses
        )

        return "\n".join([header, body])

Last, we just walk over the sympy AST to create DIMACS clauses:

from sympy.core.symbol import Symbol
from sympy.logic.boolalg import to_cnf, And, Or, Not

def to_dimacs_formula(sympy_cnf):
    dimacs_mapping = DimacsMapping()
    dimacs_clauses = []

    assert type(sympy_cnf) == And
    for sympy_clause in sympy_cnf.args:
        assert type(sympy_clause) == Or

        dimacs_clause = []
        for sympy_literal in sympy_clause.args:
            if type(sympy_literal) == Not:
                sympy_symbol, polarity = sympy_literal.args[0], -1
            elif type(sympy_literal) == Symbol:
                sympy_symbol, polarity = sympy_literal, 1
            else:
                raise AssertionError("invalid cnf")

            dimacs_variable = dimacs_mapping.get_variable_for(sympy_symbol)
            dimacs_literal = dimacs_variable * polarity
            dimacs_clause.append(dimacs_literal)

        dimacs_clauses.append(dimacs_clause)

    return DimacsFormula(dimacs_mapping, dimacs_clauses)

This just descends down the tree, until it gets the root symbol and whether or not it was negated (i.e., was in a Not indicating negative polarity). Once the symbol is mapped to its variable, we can leave it positive or negate it to maintain polarity and append it to the DIMACS clause.

Do this for all Or nodes and we have a mapped DIMACS formula.

f = to_cnf(~(A | B) | D)
print(f)
print()

f_dimacs = to_dimacs_formula(f)
print(f_dimacs)
print()
print(f_dimacs.mapping)
(D | ~A) & (D | ~B)

p cnf 3 2
1 -2 0
1 -3 0

{1: D, 2: A, 3: B}

As an aside, you probably don't want to use to_cnf to get a CNF for purposes of testing satisfiability. In general, converting a boolean formula to an equivalent CNF can result in exponential size increase.

Note in the above example for f, variable D only appeared once in the formula yet appeared twice in the CNF. If it had been more complicated, like (C | D), then that entire subformula gets copied:

f = to_cnf(~(A | B) | (C | D))
print(f)
(C | D | ~A) & (C | D | ~B)

If it was even more complicated, you can see how you end up with copies of copies of copies... and so on. For the purposes of testing satisfiability, we do not need an equivalent formula but merely an equisatisfiable one.

This is a formula that may not be equivalent, but is satisfiable if and only if the original was. It can have new clauses and different variables. This relaxation gives a linear sized translation instead.

To do this, rather than allow subformulas to be copied we will allocate a variable that represents the truth value of that subformula and use that instead. This is called a Tseitin transformation, and I go into more detail in this answer.

As a simple example, let's say we want to use a variable x to represent (a ∧ b). We would write this as x ≡ (a ∧ b), which can be done with three CNF clauses: (¬x ∨ a) ∧ (¬x ∨ b) ∧ (¬a ∨ ¬b ∨ x). Now x is true if and only if (a ∧ b) is.

This top-level function kicks off the process, so that the recursive calls share the same mapping and clause set. The final outcome is a single variable representing the truth value of the entire formula. We must force this to be true (else a SAT solver will simply choose any input variables to the formula, follow the implications, and produce an evaluated formula of any output).

def to_dimacs_tseitin(sympy_formula):
    dimacs_mapping = DimacsMapping()
    dimacs_clauses = []

    # Convert the formula, with this final variable representing the outcome
    # of the entire formula. Since we are stating this formula should evaluate
    # to true, this variable is appended as a unit clause stating such.
    formula_literal = _to_dimacs_tseitin_literal(
        sympy_formula, dimacs_mapping, dimacs_clauses
    )
    dimacs_clauses.append([formula_literal])

    return DimacsFormula(dimacs_mapping, dimacs_clauses)

The bulk of the translation is the code that adds clauses specific to the operation being performed. The recursion happens at the point where we demand a single variable that represents the output of the subformula arguments.

def _to_dimacs_tseitin_literal(sympy_formula, dimacs_mapping, dimacs_clauses):
    # Base case, the formula is just a symbol.
    if type(sympy_formula) == Symbol:
        return dimacs_mapping.get_variable_for(sympy_formula)

    # Otherwise, it is some operation on one or more subformulas. First we
    # need to get a literal representing the outcome of each of those.
    args_literals = [
        _to_dimacs_tseitin_literal(arg, dimacs_mapping, dimacs_clauses)
        for arg in sympy_formula.args
    ]

    # As a special case, we won't bother wasting a new variable for `Not`.
    if type(sympy_formula) == Not:
        return -args_literals[0]

    # General case requires a new variable and new clauses.
    result = dimacs_mapping.new_variable()

    if type(sympy_formula) == And:
        for arg_literal in args_literals:
            dimacs_clauses.append([-result, arg_literal])
        dimacs_clauses.append(
            [result] + [-arg_literal for arg_literal in args_literals]
        )
    elif type(sympy_formula) == Or:
        for arg_literal in args_literals:
            dimacs_clauses.append([result, -arg_literal])
        dimacs_clauses.append(
            [-result] + [arg_literal for arg_literal in args_literals]
        )
    else:
        # TODO: Handle all the other sympy operation types.
        raise NotImplementedError()

    return result

Now boolean formulas do not need to be in CNF to become DIMACS:

f = ~(A | B) | (C | D)
print(f)
print()

f_dimacs = to_dimacs_tseitin(f)
print(f_dimacs)
print()
print(f_dimacs.mapping)
C | D | ~(A | B)

p cnf 6 8
5 -3 0
5 -4 0
-5 3 4 0
6 -1 0
6 -2 0
6 5 0
-6 1 2 -5 0
6 0

{1: C, 2: D, 3: A, 4: B}

Even after unit propagation the resulting formula is clearly larger in this small example. However, the linear scaling helps substantially when 'real' formulas are converted. Furthermore, modern SAT solvers understand that formulas will be translated this way and perform in-processing tailored toward it.