Deobfuscating JavaScript through program analysis and machine learning
Software reliability, piracy and tampering are serious and well known threats the world is faced with. Significant attempts have been made to protect software from reverse engineering and tampering. Last week, the Software Reliability Lab at ETH Zurich, headed by Prof. Martin Vechev, released a system called JSNice which makes obfuscated and minified JavaScript code readable again. The system has already been used by more than 34,000 developers in over 140 countries.
Minification and obfuscation are two effective approaches to make code smaller and provide protection. Now, a new technique promises to undo both. Martin, your group has recently released an automatic JavaScript tool that makes obfuscated and minified JavaScript code readable again. How does your system work?
The basic idea behind our system is to leverage the availability and learn from massive amounts of available source code, e.g. from programs found in public code repositories such as GitHub. Our system works in two stages: the learning phase and the query/usage phase. In the first stage, the system learns useful regularities from the massively available source code. These regularities are learned using advanced machine learning techniques and over a set of carefully designed features extracted from the source code using program analysis techniques. The outcome of this first stage is a probabilistic model which captures code regularities of interest. In the second phase, given a program, the system automatically infers facts about that program by using the learned probabilistic model in the first stage. For instance, the system can suggest that the given program should use different identifier names or that certain properties of that program are likely to hold.
We instantiated the above approach into a system we call JSNice. JSNice targets JavaScript programs and is able to automatically suggest, with high accuracy, likely identifier names and types for the input program, a task far beyond the reach of any current approach. We released JSNice last week at external page http://jsnice.org. The system has already had significant impact: in a single week it was used by more than 34,000 unique developers in 141 countries and was featured at various developer web sites including Slashdot, HackerNews and Reddit.
Your tool relates generally to software optimization. Malicious programs, e.g. viruses, worms, Trojan horses, spyware, and other publicly unknown malware, may use obfuscation techniques to hide malicious behavior in an end-user’s system. We as consumers are witnessing that professional 'hackers' are continuously developing new attack techniques. Are the general public going to benefit from your deobfuscation algorithm?
In general, we believe that our approach has many applications in programming including de-minification and de-obfuscation of code, debugging, security, probabilistic refactoring, program synthesis and verification. As related to security, in JavaScript, malicious code is typically obfuscated and minified making it difficult for end users to know what kind of program is executing in their web browser. Because JSNice can recover original identifier names with high accuracy, it can help with discovering code that is potentially malicious or unreliable, thus making it more difficult for attackers to hide such code.
Yet despite its critical importance, software remains surprisingly fragile. Prone to unpredictable performance, open to malicious attack and vulnerable to failure at implementation regardless of the most rigorous development processes, in many cases software has been assigned tasks beyond its maturity and reliability. Martin, has ETH already laid the research groundwork for finding solutions to many of these shortcomings?
We believe the direction of leveraging the availability of massive programming data by fusing powerful program analysis and machine learning techniques will lead to fundamentally new program reasoning and construction techniques that can accurately predict and synthesize new programs as well as semantic properties about these programs. In turn, these techniques will lead to next-generation programming environments and reasoning tools, improving overall developer productivity, software reliability and security. JSNice is an example of such a system: it enables applications not previously possible and paves the way towards new hybrid approaches.
Definition:
To deobfuscate is to convert a program that is difficult to understand into one that is simple and understandable. Obfuscation is usually done to secure software from attackers, making it hard for those with malicious intentions to understand its inner functionality. Similarly, obfuscation may also be used to conceal malicious content in software. A deobfuscating tool is used to reverse-engineer such programs.