"I value the freedom of being able to listen to my innate curiosity"

In this series of interviews, we talk to three people who decided to pursue an academic career in computer science and are now working as professors. In the third installment, Ana Klimovic discusses the freedom she enjoys in academia, the value of close interactions with industry for her research and what it takes to transition from a doctorate to being a professor.

15.07.2024 by Leonid Leiva Ariosa

This interview is the last installment of a three-part series about people who decided to pursue an academic career in computer science after completing their studies, and are now working as professors.

Part 1: “Luck always plays a role in a research career” (Professor Dennis Hofheinz)
Part 2: "If you survive this, then nothing will stop you" (Professor Niao He)
Part 3: "I value the freedom of being able to listen to my innate curiosity" (Professor Ana Klimovic)

Professor Klimovic, when did you first realise that you wanted to become a scientist?
I considered many different career paths, including being a scientist, a teacher, or a writer. Being a professor turns out to be a combination of all of these. In high school, I was particularly interested in mathematics and physics. I decided to do my Bachelor’s studies in a programme called Engineering Science at the University of Toronto. I took a phenomenal course in Digital Logic and Computer Architecture during my second year in the programme, which got me fascinated about how computers work. I decided to pursue the Electrical and Computer Engineering stream of my Bachelor’s programme in Toronto and then a Master’s and a doctorate in Electrical Engineering at Stanford University.

Did you have any mentors or role models who supported or even inspired you on your journey?
Absolutely. At the University of Toronto, Professors Jason Anderson, Natalie Enright Jerger, and Jonathan Rose inspired me to pursue studies in computer architecture and become a professor. The summer research internships that I did with Professors Vladimir Stojanovic at MIT and Borivoje Nikolic at UC Berkeley greatly shaped my research interests and inspired me to pursue graduate studies. My doctoral advisor at Stanford University, Christos Kozyrakis, taught me so much about how to find and tackle research problems and inspired many of the ways in which I lead my research group at ETH Zurich today. I am also still in touch with my mentors from industry research internships, such as Eno Thereska who was my mentor at Microsoft Research and taught me a lot about how to write research papers and do large-scale systems research. At ETH, I continue to learn from co-teaching and collaborating on research projects with Professors Gustavo Alonso and Timothy Roscoe, who are both excellent mentors.

You joined ETH Zurich roughly four years ago as an assistant professor. How has your time here been so far?
That’s right, it'll be four years in September, and it's been an amazing experience so far, growing a research group from scratch, but also with a lot of support from the Institute of Computing Platforms. And in general, from the department. I think it's nice that there are a lot of other assistant professors who have joined around the same time. So, there's this community of a growing, young, very active and dynamic department. It has been a great environment for this journey of the assistant professorship.

What kind of research did you do before joining ETH Zurich?
Before I came to ETH, I did my doctorate at Stanford University in computer systems. I focused on storage systems for cloud computing, and I spent one year as a research scientist at Google before joining ETH. At Google, I was part of the Google Brain team. That’s where I started getting more and more interested in machine learning applications and their compute and storage requirements and how to make them run more efficiently from a systems perspective. This time at Google was a good opportunity to broaden my horizons before starting in the academic faculty position.

What advice would you give young researchers looking to become professors?
During your doctoral studies, you get naturally very focused and specialised. But as a professor, you need to make sure you have a broad, big-picture perspective, multiple avenues of research that of course, relate to each other, but also that you have a broad lay of the land and can use that to help guide students and make sure they're working towards fruitful directions where there are opportunities for collaborations between the students.

Was that breadth part of the reason you chose an academic career?
I think so. I value the freedom of being able to listen to my innate curiosity. And I have a lot of freedom to decide which problems I consider most important and relevant to focus on and how I approach building the solutions. In my field of research, which is very applied, a big question that often comes up is, why not pursue this in industry? Industry offers lots of funding, huge clusters, etcetera. And that is enticing. Indeed, a lot of interesting research in cloud computing systems is happening in industry. At the same time, I like this combination of collaborating with industry to understand the trends and problems on the horizon, but then working in the academic setting, where we have the freedom to approach these problems with potentially clean-slate solutions and are not tied to working on existing systems to ship a product as soon as possible. Rather, we can really think from a fundamental perspective, what is the right way to design these systems.

You just mentioned the interaction with industry as an essential element of the kind of research you do: Do you find yourself at a disadvantage being in academia because you may lack the access to resources that researchers in industry have? Or is that not a problem at all?
We are able to explore many research questions with the resources we have at ETH and by also using public clouds. We use clusters at ETH for initial experiments on a scale of roughly 20 nodes. We often also use the cloud for experiments where we scale out to more nodes. In addition, we collaborate with industry . For example, students often do internships, where they get to run even larger scale experiments with real workloads.

“As a professor you need to make sure you have a broad, big-picture perspective with multiple avenues of research.”

Professor Ana Klimovic

One of your roles as a professor is to teach. How have you experienced that side of academia so far?
I really enjoy teaching. I find it very rewarding. I'm teaching a Bachelor's course called "Systems Programming and Computer Architecture". I co-teach this with Professor Timothy Roscoe. I'm also teaching a Master’s course called "Cloud Computing Architecture" which is a topic that is especially close to my research. It’s interesting every year to revise the material. There's always new content and relevant announcements from Amazon, NVIDIA and others that come up with new hardware or new cloud services. So, it's a very dynamic field. And it's nice to share that with students and have them also work on a project where they get some hands-on experience on managing and building orchestration or scheduling systems in the cloud. And the Bachelor’s course is about peeling away layers of the onion of how the code that you write executes on a processor, what's happening in memory and in the processor while your program is executing, and how you can leverage this knowledge to design your code to run faster. So, I enjoy interacting with students, thinking about ways of presenting this material, and hearing their questions. This is one of the reasons I like academia, the combination of many different roles you play. There are a lot of different dimensions to it, and it keeps it very dynamic. Every day is different.

One of your main research areas is serverless computing. What is that and why is it important?
The core idea of serverless computing is to raise the level of abstraction for cloud users. In the standard cloud computing paradigm, you would use the cloud by renting a virtual machine. While the cloud provider takes care of maintaining and powering the physical servers, you as the user still need to make a lot of complex resource management decisions. You have this catalog of virtual machines that the provider allows you to pick from, where each machine has a different amount of central processing units (CPUs), memory, storage, accelerators, network bandwidth, etc. So, you need to make decisions about your hardware needs as if you were buying the hardware, except you're renting it. What an average application developer wants, however, is to focus on their application logic. They don’t want to be thinking exactly about how many CPU cores and what sort of server they should be renting. Serverless computing aims to make it easier to use the cloud, so we don't all need to be systems experts. Another unique aspect of serverless is the fine granularity of tasks you can run in the cloud. Instead of renting an entire virtual machine, you register fine-grained, short-lived small computations and specify when this computation should be invoked (e.g., when data becomes available from a particular source), and you're charged just for the actual execution time of the computation when it gets invoked. The key here is the system infrastructure under the hood making sure these small tasks can be scheduled and spun up quickly. One of our research directions aims to improve the efficiency and performance of these kinds of computing platforms.

Is that what the Dandelion project is about?
Yes. The idea behind the Dandelion project is to build a new serverless computing platform that provides an intuitive, high-level abstraction to users, while enabling the platform to execute functions more efficiently to support higher throughput per machine. Improving throughput also reduces costs because then you need fewer machines to serve the same load. We’re doing this by rethinking the serverless computing programming model so that we give the platform a bit more information about what part of an application involves computation versus communication. Users build applications by composing pure computation functions (where they can provide arbitrary, untrusted code) and communication functions (which are trusted code implemented by the platform and exposed to users as a library). We then build a system that leverages this programming model to execute functions efficiently. For example, since untrusted code consists of pure functions which compute on declared inputs, Dandelion can prepare the whole memory region and inputs that a function needs before it starts executing and the untrusted code can execute without support for system calls. This greatly reduces the attack surface of untrusted code compared to the attack surface in traditional serverless computing, allowing us to run functions with more lightweight isolation mechanisms. This allows us to spin up functions much more quickly, achieve higher throughput per machine (hence lower cost), all while maintaining strong security guarantees.

“This is one of the reasons I like academia, the combination of many different roles you play. Every day is different. ”

Professor Ana Klimovic

Another topic you work on is hardware or platform optimization for machine learning applications. What is the challenge you want to tackle there?
Broadly, my group works on computer systems for cloud computing and a major application domain running in the cloud is machine learning. Machine learning is very energy-intensive, it requires lots of graphics processing units (GPUs), lots of compute hours, high network bandwidth, storage, etcetera. It’s extremely computationally expensive and therefore important to optimise. We’re interested in making machine learning less expensive (i.e., more resource efficient), so that it becomes more practical to apply ML to even more settings. We approach this from a systems perspective. We look at how to get the most use of these GPUs that you're renting from the cloud. Even when running a machine learning training workload, which is very compute and memory intensive, often the utilization of your GPU can be quite low even while the job is running. There are several reasons for that low utilization. You can have bottlenecks in other parts of the system that are not on the GPU, for example in the preprocessing of the data on the CPU. This increases the cost of running your workload because your job takes longer. So, you have to rent these GPUs for longer.

Another bottleneck could be in the synchronisation of tasks and the exchange of data between GPUs. When you zoom into how a single machine learning job works, you see very spiky GPU utilization patterns with high-usage moments but on average the utilization is only 50% or lower. And so, the question becomes how do you improve this? You need to figure out a good co-location strategy and profile these applications to understand which parts of the application are compute versus memory intensive and what are good candidates to co-locate and share each GPU. This is one example of our ongoing line of work: how to schedule jobs efficiently on GPUs so that we can get high resource utilization.

Other directions we are exploring include system infrastructure to efficiently manage which data gets used for training ML models, particularly when you need to periodically re-train models with new data coming in. We are also exploring how to efficiently serve large-scale models (e.g., large language models), especially when you have multiple variations of large models that are customized for different use cases, and you need to serve requests to many model variants concurrently.

You’re also involved in the Swiss AI Initiative. What is your role there?
The goal of the Swiss AI initiative is to tap into the potential of foundation models to benefit society. Due to the computational resources required to train these models, it has remained until now mostly a realm for the big companies. But it’s important for technology like this to be explored by academics and to have the social good aspects of it in the forefront. So, the goal is to leverage an immense computing facility, the CSCS Alps cluster, and join forces between ETH, EPFL, and other institutions to train an open-source model with open-source data and apply it to different application domains, such as medical applications, education, and also fundamental science research. I'm on the steering committee that organises this initiative. I'm also a co-lead of the infrastructure aspects of the initiative, which focuses on the system software and hardware interaction for getting good efficiency and good performance when we're running AI jobs. It’s an exciting, large-scale initiative with huge potential for real impact in society. It's still in the early stages and I'm looking forward to seeing it develop in the coming years.

Ana Klimovic is an Assistant Professor in the Department of Computer Science at ETH Zurich where she leads the Efficient Architectures and Systems Lab (EASL) within the Institute for Computing Platforms – Systems Group.

She works on computer systems for large-scale applications such as cloud computing services, data analytics, and machine learning. The goal of her research is to improve the performance and resource efficiency of cloud computing while making it easier for users to deploy and manage their applications. Her research interests span operating systems, computer architecture, and their intersection with machine learning.

Before joining ETH, she spent a year as a Research Scientist at Google Brain. She completed her Ph.D. in Electrical Engineering at Stanford University, advised by Professor Christos Kozyrakis. Her dissertation was on the design and implementation of fast, elastic storage for cloud computing. She earned her Master’s degree in Electrical Engineering at Stanford University in 2015. Before that, she had graduated from the Engineering Science program at the University of Toronto in 2013, where she had earned a Bachelor’s degree in Applied Science and Engineering.

"I value the freedom of being able to listen to my innate curiosity"

Further information

Share article