Verifying halting-problem on self-implemented pseudo-assembly

247 Views Asked by At

I wrote a very simple implementation of what could be a similarity to Assembly/machine code.

It is even capable of recursion as in this example:

9 6
IFEQ R0,0
RET 1
ENDIF
MOV R1,R0
SUB R1,1
CALL R1
MOV R2,R9
MUL R2,R0
RET R2

Output: 720 (factorial of 6)

Description:

9 = Program Lines
6 = Program Input. Will be set to registry R0 value at class construction
CALL = calls the program again with the passed value (recursion)
RET = returns the program with the specified value. Sets registry R9 value to output value.

R0 to R9 -> general purpose registry.
R0 - program input value
R9 - program output value

-edit: Program commands: MOV, ADD, SUB, MUL, DIV, MOD, IFEQ, IFNEQ, IFG, IFGE, IFL, IFLE, ENDIF, CALL, RET

However the program can enter into infinite loop/recursion. e.g:

2 0
CALL 10
RET 1 //will never be reached

How do I verify whether MY program will enter into an infinite loop/recursion?

Here's my implementation, don't know whether it's necessary, but just in case you need. (It's the whole code... hope you don't mind).

#include <iostream>
#include <map>
#include <string> //std::getline
#include <sstream>
#include <vector>

namespace util
{
    template<typename I>I readcin(I& input) {
        std::cin >> input;
        std::cin.clear(); std::cin.ignore();
        return input;
    }
    template<typename I, typename...O> I readcin(I& input, O&... others) {
        readcin(input);
        return readcin(others...);
    }
}

//operations
enum OP
{
    MOV, ADD, SUB, MUL, DIV, MOD,
    IFG, IFL,
    IFEQ, IFGE, IFLE,
    IFNEQ,
    CALL,
    RET,
    ENDIF,
};
std::map<std::string, OP> OPTABLE
{
    {"MOV", MOV}, {"ADD", ADD}, {"SUB", SUB}, {"MUL", MUL}, {"DIV", DIV}, {"MOD", MOD},
    {"RET", RET},
    {"IFG", IFG}, {"IFL", IFL},
    {"IFNEQ", IFNEQ}, {"IFEQ", IFEQ}, {"IFGE", IFGE}, {"IFLE", IFLE},
    {"CALL", CALL},
    {"ENDIF", ENDIF}
};
//registry index
enum RI
{
    R0, R1, R2, R3, R4, R5, R6, R7, R8, R9, RI_MAX
};
std::map<std::string, RI> RITABLE =
{
    {"R0", R0}, {"R1", R1}, {"R2", R2}, {"R3", R3}, {"R4", R4}, {"R5", R5},
    {"R6", R6}, {"R7", R7}, {"R8", R8}, {"R9", R9}
};

struct Instruction
{
    OP op;
    RI r1;
    int r2value;
    Instruction() = delete;
    Instruction(OP operation, RI firstRegister, int _2ndRegValue = -1)
    {
        op = operation;
        r1 = firstRegister;
        r2value = _2ndRegValue;
    }
};


class Assembly
{

private:
    int REG[RI::RI_MAX] {0};
    int GetRegistryValue(RI ri) const { return REG[ri]; }
    void SetRegistryValue(RI ri, int val) { REG[ri] = val; }

    enum CMP_FLAG{ CMP_FAIL, CMP_OK };
    CMP_FLAG flag { CMP_OK };
    CMP_FLAG GetFlag() const { return flag; }
    void SetFlag(bool setFlag) { flag = static_cast<CMP_FLAG>(setFlag); }

    std::vector<std::string> programLines;

    OP ExtractOP(const std::string& line);
    RI ExtractRI(const std::string& line, OP op);
    int Extract2ndRIval(const std::string& line, OP op);
public:
    void addCommand(const std::string& line) { programLines.push_back(line); }
    void Execute();

    Assembly() = delete;
    Assembly(int inputValue) { REG[R0] = inputValue; }
    int ReturnValue() const { return REG[R9]; }
private:
    //recursion only
    Assembly(int inputValue, const std::vector<std::string>& progLines) {
        REG[R0] = inputValue;
        programLines = progLines;
        this->Execute();
    }
};

OP Assembly::ExtractOP(const std::string& line)
{
    std::istringstream issline{ line };
    std::string operation;
    issline >> operation;

    return OPTABLE[operation];
}

RI Assembly::ExtractRI(const std::string& line, OP op)
{
    auto space = line.find(' ');
    if(op <= IFNEQ){
        auto comma = line.find(',');
        return RITABLE[std::string(line.begin() + space + 1, line.begin() + comma)];
    }
    return RI_MAX;
}

int Assembly::Extract2ndRIval(const std::string& line, OP op)
{
    if(op == ENDIF) {
        return -1;
    }

    std::size_t spaceOrComma;
    if(op == CALL || op == RET) {
        spaceOrComma = line.find(' ');
    } else {
        spaceOrComma = line.find(',');
    }

    std::string opval = std::string(line.begin() + spaceOrComma + 1, line.end());
    auto it = RITABLE.find(opval);
    if(it != RITABLE.end()){
        return this->REG[it->second];
    }
    auto num = std::atoi(opval.c_str());
    return num;
}

void Assembly::Execute()
{
    for(const std::string& line : programLines)
    {
        OP op = ExtractOP(line);
        RI r1 = ExtractRI(line, op);
        int r2value = Extract2ndRIval(line, op);

        Instruction command ( op, r1, r2value );

        if(GetFlag() == CMP_FAIL)
        {
            if(command.op == ENDIF){
                SetFlag(CMP_OK);
            }
            continue;
        }

        switch(command.op)
        {
            case MOV: { SetRegistryValue(command.r1, command.r2value); } break;
            case ADD: { SetRegistryValue(command.r1, REG[command.r1] + command.r2value); } break;
            case SUB: { SetRegistryValue(command.r1, REG[command.r1] - command.r2value); } break;
            case MUL: { SetRegistryValue(command.r1, REG[command.r1] * command.r2value); } break;
            case DIV: { SetRegistryValue(command.r1, REG[command.r1] / command.r2value); } break;
            case MOD: { SetRegistryValue(command.r1, REG[command.r1] % command.r2value); } break;

            case IFEQ:  { SetFlag(GetRegistryValue(command.r1) == command.r2value); } break;
            case IFNEQ: { SetFlag(GetRegistryValue(command.r1) != command.r2value); } break;
            case IFG:   { SetFlag(GetRegistryValue(command.r1) > command.r2value); } break;
            case IFL:   { SetFlag(GetRegistryValue(command.r1) < command.r2value); } break;
            case IFGE:  { SetFlag(GetRegistryValue(command.r1) >= command.r2value); } break;
            case IFLE:  { SetFlag(GetRegistryValue(command.r1) <= command.r2value); } break;

            case RET:
            {
                SetRegistryValue(R9, command.r2value);
                return;
            }break;

            //oh boy!
            case CALL:
            {
               // std::cout << "value to call:" << command.r2value << std::endl;
                Assembly recursion(command.r2value, this->programLines);
                SetRegistryValue(R9, recursion.ReturnValue());
            }break;
        }
    }
}

int main()
{
    while(true)
    {
        int pl, input;
        util::readcin(pl, input);
        if(pl == 0){
            break;
        }

        Assembly Asm(input);
        for(auto i=0; i<pl; ++i)
        {
            std::string line;
            std::getline(std::cin, line);
            Asm.addCommand(line);
        }
        Asm.Execute();

        std::cout << Asm.ReturnValue() << std::endl;
    }
    return 0;
}
4

There are 4 best solutions below

3
Ross Ridge On BEST ANSWER

The only way to check to see if a program is stuck in an infinite loop in the general case is to check to see the program has entered the same state as previous state. If it has entered exactly the same state previously, then it must continue on executing in a loop returning to the same state over and over following the same sequence of steps. In real programs this essentially impossible because of the huge number of possible states the program can be in, but your assembly language only allows much more limited number of possible states.

Since your CALL instruction works just like invoking the program at the start and this is the only form of looping, this means that checking if the code enters the same state twice is simple. A CALL instruction with a certain argument has the exact same effect as invoking the program with that argument as an input. If the CALL instruction has the same argument as any previously executed CALL instruction or the program's input value, then it must continue executing in a loop endlessly returning to same state in the same sequence of steps.

In other words, the only state that needs to be checked is the R0 value at the start of the program. Since this value is stored in a int, it can only have 2^32 possible values on any common C++ implementation, so it's reasonable and easy to brute force check see if a given program with a given input gets stuck in infinite loop.

In fact, it's possible to use memoization of the return values to brute force check all possible inputs in O(N) time and O(N) space, where N is number of possible inputs. There are various ways you could do this, but the way I would do it is to create three vectors, all with a number of elements equal to the number of possible inputs. The first vector is a bool (bit) vector that records whether or not a given input has been memoized yet, the second vector is also bool vector and it records whether a given input has been used already on the call stack, the second vector is an int vector that records the result and is used a linked list of input values to create a call stack to save space. (In the code below these vectors are called, is_memoized, input_pending and memoized_value respectively.)

I'd take your interpreter loop and rewrite it to be non-recursive, something like this pseudo-code:

input = reg[R0]
if is_memoized[input]: 
    reg[R9] = memoized_value[input]
    return
input_pending[input] = true
memoized_value[input] = input  // mark the top of the stack

while true:
    for command in program:

        ...

        if command.op == CALL:
             argument = command.r2value

             if input_pending[argument]: 
                 // Since this input value is ready been used as input value 
                 // somewhere on the call stack this the program is about to enter
                 // an identical state as a previous state and so is stuck in
                 // a infinite loop.
                 return false  // program doesn't halt

             if is_memoized[argument]:
                 REG[R9] = memoized_value[argument]
             else:
                 memoized_value[argument] = input   // stack the input value

                 input = argument
                 REG = [input, 0, 0, 0, 0, 0, 0, 0, 0, 0]
                 input_pending[input] = true
                 break  // reevaluate the program from the beginning.

        if command.op == RET:
              argument = command.r2value
              stacked_input = memoized_value[input]
              input_pending[input] = false
              if stacked_input == input:  // at the top of the stack
                  REG[R9] = argument
                  return true   // program halts for this input

              is_memoized[input] = true
              memoized_value[input] = argument
              input = stacked_input
              REG = [input, 0, 0, 0, 0, 0, 0, 0, 0, 0] 
              break  // reevaluate the program from the beginning

You'd then call this interpreter loop for each possible input, something like this:

for i in all_possible_inputs:
     if not program.execute(input = i):  // the function written above
          print "program doesn't halt for all inputs"
          return
print "program halts for all inputs"

A recursive version should be faster since it doesn't have to reevaluate the program on each unmemoized CALL instruction in the program, but it would require hundreds of gigabytes of stack space in the worst case. This non-recursive version only requires 17 GB of memory. Either way it's still O(N) space and time, you're just making one constant factor smaller and another bigger.

To get this execute in reasonable amount of time you'd also probably want to parse the code once, and execute some byte code representation instead.

8
Brendan On

How do I verify whether a program will enter into an infinite loop/recursion?

In practice; the halting problem is trivial to solve. It's only impossible in theory.

The reason people think that halting problem is impossible to solve is that the question is constructed as a false dilemma ( https://en.wikipedia.org/wiki/False_dilemma ). Specifically; the question asks to determine if a program will always halt or will never halt; but there's a third possibility - sometimes halting (and sometimes not halting). For an example if this, imagine a program that asks the user if they want to halt forever or exit immediately (and correctly does what the user requested). Note that all sane applications are like this - they're not supposed to exit until/unless the user tells them to.

More correctly; in practice, there are 4 possibilities:

  • runs until something external causes it to halt (power turned off, hardware failure, kill -9, ...)
  • sometimes halts itself
  • always halts itself
  • indeterminate (unable to determine which of the 3 other cases it is)

Of course with these 4 possibilities, you can say you've created a "halting problem solver" just by classifying every program as "indeterminate", but it won't be a good solution. This gives us a kind of rating system - extremely good "halting problem solvers" rarely classify programs as "indeterminate", and extremely bad "halting problem solvers" frequently classify programs as "indeterminate".

So; how do you create a good "halting problem solver"? This involves 2 parts - generating control flow graphs ( https://en.wikipedia.org/wiki/Control-flow_graph ) for each function and a call graph ( https://en.wikipedia.org/wiki/Call_graph ) for the whole program; and value tracking.

Control Flow Graphs and Call Graph

It's not that hard to generate a control flow graph for each function/procedure/routine, just by examining the control flow instructions (call, return, jump, condition branch); and not that hard to generate a call graph while you're doing this (just by checking if a node is already in the call graph when you see a call instruction and adding it if it's not there yet).

While doing this, you want to mark control flow changes (in the control flow graph for each function) as "conditional" or "not conditional", and you want to mark functions (in the call graph for the whole program) as "conditional" or "not conditional".

By analyzing the resulting graphs you can classify trivial programs as "runs until something external causes it to halt" or "always halts itself" (e.g. this is enough to classify OP's original code as "runs until something external causes it to halt"); but the majority of programs will still be "indeterminate".

Value Tracking

Value tracking is (trying) to keep track of the possible values that could be in any register/variable/memory location at any point in time. For example, if a program reads 2 unsigned bytes from disk into 2 separate variable you know both variables will have a value from 0 to 255. If those variables are multiplied you know the result will be a value from 0*0 to 255*255; if those variables are added you know the result will be a value from 0+0 to 255+255; etc. Of course the variable's type gives absolute maximum possible ranges, even for assembly (where there's no types) - e.g. (without knowing if it's signed or unsigned) you know that a 32-bit read from memory will return a value from -2147483648 to 4294967296.

The point of value tracking is to annotate conditional branches in the control flow graph for each function; so that you can use those annotations to help classify a program as something other than "indeterminate".

This is where things get tricky - improving the quality of a "practical halting problem solver" requires increasing the sophistication/complexity of the value tracking. I don't know if it's possible to write a perfect "halting problem solver" (that never returns "indeterminate") but I do know that it's possible to write a "halting problem solver" that is sufficient for almost all practical purposes (that returns "indeterminate" so rarely that nobody cares).

7
Margaret Bloom On

If the program steps into the same configuration twice then it will loop forever. This is also true for Turing Machines, it is just that the (infinite) input is part of the machine's configuration.
This is also the intuition behind the pumping lemmas.

What constitutes a configuration in your model?
Since you have no memory and no IO, a configuration is given by the content of the registers and the line number of the current instruction (i.e. the instruction pointer).

When do a configuration change?
After every instruction, sure. But in the case of a non-branching instruction, the configurations before and after it are surely different because even if the instruction is a NOP then line number changed.
In the case of a branch, the line number might be one that we've seen before (for a backwards branch), so that's where the machine could step into the same configuration.

The only jumping instruction of interest, to me, seems to be call. The IF like ones will always produce different configurations because they are not expressive enough to produce iteration (jump back and repeat).
How does call change a configuration? It sets the line number to 1 and all the registers (except r0) to zero.
So the condition for a call to produce the same configuration reduces to having the same input.

If you check, in the call implementation, if the operand value has already been used before (in the simulation) then you can tell that the program will loop forever. If a register has size n, then the possible states are O(2n), which is generally a lot.
You must be prepared to give up after a (possible customizable) threshold. Or in your case where your registers are int, most C++ implementations have 32-bit int, and modern machines can handle a 512MiB bitmap of 2^32 bits. (std::vector<bool> implements a packed bitmap for example; index it with unsigned to avoid negative ints). A hash table is another alternative (std::unordered_set<int>). But if you used a wider type for your registers, the state would be too big to practically record every possible one and you would need some limit. A limit is kind of built-in to your implementation as you will overflow the C++ callstack (C++ recursion depth) before seeing anywhere near 2^32 repeats of the machine being simulated.

If the registers are unbounded in their value, this reduces to the Halting Problem and thus undecideable in the general case. (But as @Brendan says, you can still look for early repeats of the state; many programs will terminate or infinitely repeat in a simple way.)

If you change the call implementation to not zero out the other registers, you must fallback to check the whole configuration at the call site (and not just the operand).


To check the termination of the program on every input you must proceed non-deterministically and symbolically.
There are two problems: the branches and the input value.

It is a famous theorem that an NDTM can be simulated by a TM in an exponential number of steps w.r.t. the steps of the NDTM.
The only problematic instructions are the IF ones because they create non-determinism.
You can take several approaches:

  • Split the simulation in two branches. One that executes the IF one that does not.
  • Rewrite the code to be simulated to produce an exponential (in the number of branches) number of branch-free variants of the code. You can generate them lazily.
  • Keep a tree of configurations, each branch in the program generating two children in the current node in the tree.

They are all equivalent.

The input value is not known, so it's hard to tell if a call ends up in the same configuration. A possible approach is to record all the changes to the input register, for example you could end up with a description like "sub(add(add(R0, 1), 4), 5);".

This description should be easy to manipulate, as it's easy to see that in the example above R0 didn't change because you get "sub(add(R0, 5), 5);" and then "R0;".
This works by relying on the laws of arithmetics, you must take care of inverse operations, identity elements (1 * a = a) and overflow.
Basically, you need to simplify the expression.
You can then check if the input has changed at a given point in the simulated time.

0
RBJ On

I take it you're looking for outside-the-box thinking.

Think of the halting problem this way. Turing proved programs are free from control. But why? Because the language has instructions to control execution. This means feasibly regulating and predicting execution in programs requires removing all control semantics from the language.

Even my collaborative process architecture doesn't accomplish that. It just forbids them because of the mess they make. It is still composed from a language which contains them. For example, you can use IF to break, return or continue but not for other operations. Function calls are illegal. I created such restrictions to achieve controllability. However not even a collaborative organization removes control structures from the language to prevent their misuse.

My architecture is online via my profile with a factorial example in my W article.