Some of these ideas may sound pretty intimidating, but with explanations may not be as hard as they seem. For example, extending Elsa/Oink to a C++ compiler sounds like a lot of work, but even a naive code-generator of the kind written in an undergraduate compiler class would be quite helpful to us for testing purposes.
For those who just want to get to know the system, we can always use more test writing, test minimization, code auditing, documentation auditing, etc. There is a lot of C++ out there and not all of it goes through Elsa yet; for each file that doesn't, we need someone to figure out what is going on, minimize the input that causes the bug, and file a bug report. In a Zen temple, the head monk is the one who cleans the toilet.
See also the ideas on the Mozilla page for static analyses that Brendan Eich and I thought up to do on Mozilla with Oink.
C++ compiler / Code generator: We already have a C++ front-end; if it had a code-generator, it would be a compiler.
- Even though Oink is primarily targeted as an analysis tool, it would be much easier to test it if we had a code-generator. That is, we could have tests that consist of a batch program mapping from an input to an output and we could compile that with gcc and the oinkcc compiler and then test that they have the same input-output behavior.
- The internals of gcc are reputed to be horrible and perhaps we should compete with gcc/g++ and try to replace it.
C++ interpreter: It would be fun to have a source-level debugger / C++ interpreter. That is, parse the program and interpret the abstract syntax tree directly.
- If we had a C++ interpreter, we could build a dynamic (run-time) taint analysis (as Perl has for example). We could then check the static taint analysis versus the dynamic. The standard implication relating conservative static and dynamic analyses should hold: if it is tainted at runtime, then it should be maybe-tainted at static time.
Better linker imitator: We imitate the linker so that we can do whole-program analysis, but we do not fully implement linker functionality.
- (De-facto-)standards-compliant name mangler: While perhaps there is not an official standard, there seems to be de-facto standards. Having this would allow the aforesaid compiler / interpreter to link with existing libraries.
Source-to-source transformation: Programmatic AST re-arrangement and pretty-printing for refactoring support or insertion of dynamic analysis.
Note: This has been done by Taras Glek at Mozilla in his project "Pork".
- Merge Pork back into Elsa/Oink.
Better serialization: Though the serialization of the AST is automated in Scott Mc Peak's astgen language, the serialization of the type-system has too much hand-written code and can get out of sync with the type-system.
- Factoring-out the type-system data: If the type-system data were managed by the astgen language it would greatly improve the ease of serialization of the data-structures not only into the existing XML format but also into ML. An idea of Hendrik Tewes.
Semantic-aware source browser: A version of LXR that could print out not only the class hierarchy but could also print the results of various analyses on the code.
Semantic grep: A regular-expression-like mini-language for matching on parts of subtrees of the AST and possibly type-system.
- Compiled version: If this could be compiled, analyses could be efficiently written in just this language and hand-coded C++ analyses could be dispensed with.
- Semantic find-replace: semantic grep with capture and replace features, such as Perl's ~= s/(foo)waga/bar$1/.
More static analyses: The main use of Oink is to allow for helpful static analyses of programs.
- Static single assignment analysis: The dataflow analysis would be more precise for stack variables if we had a transformation to static single assignment form.
- Liveness analysis: Find out what expressions still might be used for something.
- Points-to/Alias analysis: The dataflow analysis would become more precise if we had a points-to/alias analysis.
- Unit analysis: I have heard that a study of Department of Defense software concluded that if numbers had units, they would be only "rounds", "rounds per minute", and "megatons". Somehow I think we use more units that that now and it would be really helpful to check them. This would have saved one of the Mars missions where metrics and English units got mixed up. It got to Mars allright... and hit it really hard.
More to come...