I'm doing some work where I need to be able to describe modifications to some program code that are to be done automatically.
Is there any language that allows to describe this?
The language should have modules or functions that receive the location in the code where the modification is to be done and should allow specifying the possible modifications to be done.
It should allow describing modifications such as removing a given function, adding an if condition around a piece of code, adding a new function declaration that does nothing, etc.
The modifications should be done over the parse tree so it is possible to restore the original code, only with the modifications.
I don't even need the language to have a parser or an implementation associated, all I need is the description of the language itself, either as a BNF grammar or even informally.
I know that phc, the PHP ahead of time compiler, is able to transform the source code into a XML representation and back, making it easier to modify the code and restore it. What I need is a way to describe the actual modifications to the XML so that I can run a program that can for example remove all instances of a specific function call, or add if(false) around each. Also, it would be better if the language was language-agnostic, although its not a requirement.
Do you think something like this exists?
The key idea is program transformations. Ondrej has the right idea with DMS but I'm the author of DMS so I'm likely biased.
The DMS language used to accomplish transformations is called the "(DMS) Rule Specification Language", or RSL, and is used to specify (program transformation) rules. Such a rule has:
The patterns are often written in the surface-syntax of the target language, that is the native syntax of the language being transform with extensions for pattern variables . To distinguish the RSL language syntax from the target langauge, patterns are written inside (meta) quotes "...". THe \ character inside patterns is a (meta)escape back into RSL. A pattern variable is written "\x". A (meta)function foobar is written as \foobar( ... ), note the (meta)escape on the (meta)functions argumements. Outside of the quotes, the meta-escapes are needed and these construct are written without \, e.g., foobar(...).
DMS rules can be a lot more complex than this, but these are the basics. The surface-syntax patterns do not represent text; rather, they really represent the equivalent ASTs of code in the patterns. DMS rules are used to match and change ASTs. The program transformation system of course has to have parsers to produce ASTs, and anti-parsers ("prettyprinters") to convert ASTs back to text. (DMS has a big library of langauge front ends for all the widely used langauges on the planet and a lot of the uncommon ones; we just added MUMPS).
For your specific examples, the following rules will do the trick:
"... removing a given function":
... adding an if condition around a block of code:
... adding a new function declaration that does nothing:
As you have observed, you have to point these somewhere; that's the job of a "metaprogram" which locates where in you AST you want the rules applied, and then it applies them. For your rules, you need (with DMS) and explicit procedural method to find the right location. For some DMS rules, you can simply apply then "everywhere"; DMS will essentially walk all over the AST designated and apply the rules for you.
Several rules are never very impressive, in the same way that several lines of code aren't impressive. A few hundred or thousand rules can do pretty spectacular things (like complete langauge translations), in the same way that few hundred or thousand lines of code can produce pretty interesting results. The difference is that conventional code works with numbers, strings and structures, and program transformation tools compute over program structures (ASTs).
There's a complete worked example showing how one defines a language and rules to DMS, and how those rules are applied to achive "program modifications" (the example actually modifies "algebraic expressions" but the ideas are exactly the same).
DMS is unabashedly commercial, and it isn't a dimestore tool, so it might not be what you need for your thesis.
If not DMS, you can get free tools that have the same ideas. Consider TXL (www.txl.ca) or StrategoXt (www.strategoxt.org). DMS, TXL, Stratego all do program transformations using surface syntax patterns, but TXL and Stratego can't handle massive changes to code as well as DMS IMHO. (Read about flow analysis at the DMS website for some reasons). TXL and Stratego are good for learning the basics and build strong demos, though.