All science is becoming data science

Sep 8, 202114 min read

This is the eleventh in a series of interviews with members of the Machine Commons supplier Collective. Subscribe to the site to be alerted about future posts, or become a partner today!

Steve Purves is the co-founder of Curvenote, a collaborative writing platform for scientists and engineers, based in Alberta. Their mission is to bring the scientific method into the 21st century. In short, they're bringing dynamic processes from the software engineering field - such as collaboration, live links and version control - into a platform that's as easy to use as Google Docs.

Because not all scientists should need the skills of data scientists to work with data.

How have you found business during the pandemic?

“The pandemic, I think, has created a huge case for a change in direction. Back at the start of last year, I was working in a startup for the oil and gas industry."

"That’s an industry I’d previously left and had no intention of going back to, but was lured back by the prospect of being able to apply the new generation of machine learning tools to 3D seismic data."

Classic, lured by the opportunity to apply new tools to interesting data!

“Then, due to the pandemic, the oil industry hit crisis mode. There was a price crash, the pandemic… it just bombed the industry out for a while more severely than the cyclicality the industry sees, and belts were tightened - especially in R&D.”

“The pandemic really gives you a better understanding of how you’re spending your time."

"It puts things in perspective, for us and for a lot of people. I didn’t want to work in the oil industry anymore, except perhaps to help with the clean energy transition.”

“Then I met my co-founder Rowan, and we found that we had this shared mission: to create positive change, to change science - to change how it was done."

"To make it more effective with tools that were built for scientists today and support a modern scientific collaboration process.”

What do you mean, what’s wrong with the current tools available to the scientific community?

“Currently many researchers use a mish-mash of available tech – like Dropbox, Microsoft Word, Google Docs, Latex, git – managing to do all their work with stuff that wasn’t really made for them. As more scientists’ work involves data, this also includes increasingly specialized data analysis tools adding to the complication”

“What we are doing is trying to get a modern collaboration process going.”

“All science is basically turning into data science."

"Perhaps not in the ‘big data’/’machine learning’ sense of the word; not data science as traditional ‘tech’ businesses would see but, increasingly: scientists, engineers, and other technical professionals, must deal with data throughout traditional industries and the sciences. Those professionals are getting into Python, R, Julia and using those to process and analyze their data, to present their work.”

“I think things are getting more challenging, they’re getting even more computational.”

So, business has been interesting then?!

“The tech sector has been booming, especially around collaboration and communication, anything that supports remote working. I’ve been working remotely for over 8 years now but many people have now been pushed into that lifestyle and workstyle. It sort of highlighted the need for better tools for science specifically and in the face of a pandemic the importance of good science that can progress rapidly.”

“So yes, business is good for us!”

“We got accepted into YCombinator back in December 2020 and went through their W21 program. They’ve been really pushing us to look at how to grow the business. Defining future growth and what the business looks like 1, 5, 10 years down the line. That’s given us a real focus for us on our core mission, providing these tools for people working in research, science, and engineering - data scientists and the people they need to collaborate with.”

“We’re quite a ‘deep tech’ focussed company even though we’re software-based: they’ve really tuned in to long-term potential.”

Tell me what you do, what exactly is Curvenote?

“It’s a collaborative writing platform for scientists and engineers."

"The aim is to get a GoogleDocs like experience around writing reports, papers, discussion documents, for any technical writing."

"Good for collaborative editing, threaded discussions; designed for ease.”

“But also with all the precision you get from a professional writing software like LaTeX -- full maths support, citations, references, figures, cross-referencing but with a WYSIWYG document like Google Doc, where you don’t need to write LaTeX markup and five of you can easily work on it at the same time.”

“Curvenote also integrates with Jupyter, which is the predominant open-source data science notebook platform and does some pretty powerful things. It adds version control, publishing and real-time commenting, directly in Jupyter and works whether that’s running locally, on Jupyter Hub, Amazon Sagemaker, Coiled (a DASK powered notebook in the cloud), Saturn Cloud and more. Basically anywhere where there’s an embedded Jupyter instance. What this means is that a data scientist on a team can work directly between Jupyter notebooks and written documents and reports. Embedding results and visualisations and keeping them up to date in a couple of clicks.”

“Some of our users are geoscientists and geotechnical engineers, some are in bioinformatics and others in data science.”

“So, you’ve run your notebook, generated results. You think: how do I get this to the rest of my team? Or how do I get all these graphs out into a report? Once saved in Curvenote, all of the notebook cells and outputs can be used in reports and documents that others can contribute to, review and consume. This enables domain experts, non-coding team members, and other stakeholders like sales groups, or management to collaborate and review output from data science work in a much more fluid way."

"Feedback and iteration cycles get much shorter, and as there are no screenshots or copying and pasting graphs and data, everything remains linked and traceable, and reproducible right through to final hardcopy output if you need it.”

He says if results and visualizations are ever updated (as happens regularly), then it can be easily updated wherever it is used.

“We’re getting rid of the copy-paste step, where scientists copy graphs into Word or send images by email or Slack. Now they can directly pull it out of versioned Jupyter notebooks.”

“Then every graph links back to where that information is from. The precise version of the notebook, updated over time, the code used for that run, the dataset that was loaded, and so on. When a report is published people can be confident about which version of the data analysis was actually shipped.”

“So, you end up creating this connected web of pieces of work that are immutable and trackable. Which is a big step for being able to audit these work processes.”

So what? What’s the wider shift - how does it ‘improve science’?

“For a start, one thing is productivity; you save a bunch of time.”

“Take feedback. You’re not trying to integrate comments from three different forms. Typically, you might get email back, comments from Google docs or wherever, someone might print your article and scan it back as a PDF with handwritten notes. And you’re trying to integrate all this stuff into your work, your paper or report.”

I sort of knew this about research – I once had a professor whose desk was literally a mountain of paperwork. I once asked how he kept track of it, to which he responded “it grows like a tree” “2011, 2012, 2013” – gesturing with his hand down layers of his tree. Academics can be so analog!

But when he mentions this out loud I immediately recognise how daft the sheer lack of technical structure in research is. There’s no set way of doing it.

He continues to outline the current workflow for academics and researchers that he seeks to replace.

“Our system allows you to get your comments back directly.”

“The other factor is that science still relies heavily on PDF. Generating static reports – that’s something we want to change. To encourage them to publish reports and their work in a more interactive online form, so it stays interactive and richer.”

“People can follow links on your document, so people can interact with live visualisations. They can use parts of your research in their own much more easily.”

“If someone wants to reference a particular figure, you currently have to reference the whole paper and specify the figure in the references section. You can now just reference the figure directly. A much easier and more granular way of referencing previous work.”

“Third is that this is a big step towards reproducibility - if your papers are created with links back to the computational notebooks, environments, and datasets used in the research, it is far easier for others to reproduce and validate the work and to build upon it. This cycle of reproduction and extension is fundamental to the scientific process yet paper-based publications (which PDFs essentially are) keep us back in the dark ages.”

Reproducibility is a major problem in science, isn’t it?

“Some crazy number – in like 60% of papers – the researchers have tried to reproduce the results but couldn’t.”

“Paperswithoutcode.com was a website set up to show papers that were really bad. It’s clear something needs to be done about them.”

“I think this is part of the problem: producing those reproducible papers can take a huge amount of extra effort and requires an additional skills base that many scientists just don’t have."

"They do some great work and some impressive DevOps style engineering around their work to make this possible. But this means only a small percentage can actually achieve this goal.”

“The people who do manage to publish their research with their figures alongside, they’ve figured that out themselves, how to publish all their code alongside. Let’s say a small percentage of people are able to do that, that's a huge learning curve.”

”What we want to do with Curvenote is make this much, much more accessible, so you can work in this way by default and Curvenote takes care of the heavy lifting for you. From your day-to-day work to online interactive preprints and journal-ready publication copy without having to build your own website or become a LaTeX guru.”

Is this the case with all research?

“Take data science research. Such as someone creating a new image-net dataset for neural network architecture development. Imagine that the researcher is very organized.”

“They’re very disciplined about downloading code, collating images all stored with different URLs, so anyone can run a script and just reproduce those same image-net results.”

“When it came time to publish their paper, they just make their repository available, as that’s just how they work. It’s all in sync.”

“These people are at the high end of the spectrum, they already have a lot of the DevOps style skills required.”

Right, any changes automatically update in the repository and the most recent version is always available to anyone who wants to come along and validate those results.

“Compare that to another researcher. They haven’t learned how to make their model weights available, setting up their working environment, fetching script. They haven’t learned that so all of a sudden it’s an extra two days of work.”

‘’For researchers outside of this data science or machine learning field, it’s an even bigger ask because the skills overlap isn’t there”

“We just want to make it easier to work in a reproducible way day-to-day, in any branch of science. The tools must apply to day-to-day work and collaboration.”

He’s saying scientists have to learn web development skills to publish research with data.

“Yes, when researchers are setting this up for themselves, they’ve effectively learned to become a DevOps person, they’ve had to learn how to get their data on Amazon s3 or Zenodo, fetch using HTTP, using Git for versioning, continuous integration tools for aggregating and building documents or a static website…”

“A researcher has had to learn software engineering and DevOps tools.”

“But 99% of scientists haven’t done that. And should they?? They’re useful skills, but ones that go out of date fast too, should they have to learn this?! They should just be able to move around their data easily without having to be a DevOps person.”

That makes sense, why should a scientist have to be a software developer just to work with their data.

“We’re providing those sorts of features that are readily available to software and DevOps people, like version control and package management, but in an easy-to-use way.”

Why are DevOps skills/tools so useful to a scientist?

“If writing code in python, there are hundreds of thousands of libraries. Any developer can easily import directly into what they’re working on.”

“But that doesn’t happen in science and yet publication is supposed to be all about providing information on new advances so that we can use that to build the next incremental step. So, if we can provide a sort of package manager where someone could just check out the last paper, grab an equation, and put it in their paper – just like you would in a package manager – then that would have a huge impact on science.”

“Imagine how many times researchers have typed out the mathematics behind the back-propagation algorithm."

"Imagine if they could just pull it from the last paper they cited, or the original that everyone cites. Imagine if when that paper was updated, with errata or new notation, everything downstream could be updated. That would cause a big change.”

“If we can do that, if we start taking science out of a PDF and put it in a format that’s more like the internet. It would accelerate science.”

It’s a wonderful idea, but it strikes me as needing to be used in order to be useful – a little ‘chicken or the egg’.

Will you be importing published PDFs so there’s an existing body of research in the system?

“We’re focused on creating new research on the platform, as opposed to bulk importing. We’re trying to help people change their day-to-day workflows. Of course, they can import their work and build on it in Curvenote but this body of work will build up over time as people use the platform”

“Also a lot of existing research is not publicly available. Research that is publicly funded ends up behind a paywall with publishers having a right to it. A researcher or research group would have to pay thousands to the journal to provide an open-access version of that work.”

“Here I am publishing work, it’s my work, funded by someone else, I’m giving it to a publisher to share, now I have to pay the publisher to waive that fee. This is changing through more open access journals appearing and through legislative pressure, but change will take time.”

That does sound ridiculous.

Publishers have the institutions doing the research completely under their control and paying them for the privilege.

“I think it’s hard to see it as a sustainable model, especially as the pace of research continues to increase.”

“Do you know Arxiv.org?", he asks.

"It’s where many computer science papers are published. It’s a preprint service and there are many others serving different scientific fields. After writing the paper, the peer review process for a traditional journal can take anywhere from 3-6 months before it can go forward to publication. This is clearly too long. Also, reviewers also tend to work on a volunteer basis, with publications volume increasing and this system is already stressed and it’s difficult to see how you can speed it up within the current model but clearly, this is too slow.”

“People are publishing as soon as they’re ready to on preprint services – months ahead of time, so it hasn’t been peer-reviewed and results haven’t been validated with that process but the availability of results means that others can actually start trying to reproduce the work more quickly, which I think is a level of validation that goes beyond peer review. There are tens of thousands of papers per day published on Arxiv and other services and it’s growing.”

No, I haven’t heard of Arxiv.org. It’s completely free??!

He shares his screen, which shows a graph of Arxiv.org monthly submissions. It’s basically exponential.

Publishing institutions lagged behind innovation and this free hosting site has filled a vacuum.

“All machine learning and computer science-related material is in here.”

Let’s not forget that ‘pre-published’ means ‘not published’.

How do you know what you can trust?

“You have to know what you can trust, who produced the research, what lab, who the supervisor was, I guess you can do a quality check based on reputation.”

“At the end of the day, this is where reproducibility comes in. If it was possible to also reproduce that paper independently at any time, easily by running a script, then that sort of makes the peer review process deterministic in a way. The graphs are right there. All that’s left is interpretation.”

“You often see papers that take 5-10 peers and summarise the papers. If everything was faster with digital tools, the pace of science could just change. The scientific workflow could be much faster.”

If it’s quicker to reproduce, it’s quicker to pick it up and take a step forward. Publish that. The virtuous cycle continues.

“At the moment, this cycle takes a huge amount of time.”

He repeats that it’s searchable, unlike a PDF.

“There are these huge barriers to scientific progress due to the wrong tech being used. Arxiv removed just one element – the peer review stage – and this has allowed people to publish their papers earlier albeit with caveats.”

“Computer science and ML has probably already accelerated as a result of Arxiv. It’s a field where a lot of times, researchers can validate the results themselves.”

I love the idea that removing a point of friction – the peer review process – is accelerating machine learning. I suppose it comes with workarounds.

“Workarounds. Maybe they’re not workarounds, maybe they’re actually better alternatives.”

“When journal reviewers have a mountain of paper publications to get through, perhaps peer review isn’t better, perhaps distributed validation of results is, in fact, a better way of doing it."

"It’s like crowdsourcing it among the scientific community that’s trying to progress in the field. It becomes a much more open and collaborative model, which is maybe closer to one we should be embracing.”

Ok, so research gets ‘crowdsourced’. Then what? Your mission is complete. What does the world look like?

“That world looks like somebody being able to search through scientific ideas as easily as you would search Wikipedia today. Instead of hitting a wall – you can’t access this – or hitting a huge wall of text that is difficult to reach into. Much more discoverable.”

“At the moment, a researcher uses a fantastic amount of computing power to run their simulation, such as to look at a new material or its properties in a new way; but, in a scientific paper, they only have room for one or two static figures from that environment."

"Imagine being able to publish an interactive thing for people to explore those results!”

“You don’t just get the one answer – the optimal configuration of the material, but instead a slider to show different aspects. Previously, you’d have to go back and re-run the whole simulation. With modern visualization tools, we could let people explore ‘what if’ scenarios themselves, with the data produced by the scientists. That would be completely different from reading a static paper.”

Interesting. So tell me about how you plan to get there? This seems to be a long play before you can start delivering these changes, isn’t it?

“Yeah, there are some far-reaching ideas here and it’s this type of change in science that we are excited about helping happen."

"What has really struck us is how pervasive some of the problems are. Not just in general scientific research but right across scientific disciplines from academia to research labs, and within many many businesses.”

“This disconnect exists pretty much everywhere. Whether it’s a graduate student collaborating or communicating with their peers and profs in the lab, or it’s a data science team needing to get information into the hands of management or sales teams. The disconnect means lots of manual work, but also a break in the audit trail. There is room for error there, uncertainty about the provenance of that information.”

“This is a bad situation to be in from a data management point of view but happens regularly because groups of people tend to be pulled into silos by the tools they use and those silos create these disconnects.”

“Say you have a data analytics, or data science or some form of data team working with a toolchain, say their data science platform is Jupyter...”

”...They need to get data analysis results out to customer account teams in ways that can be consumed directly by the customer, they need to collaborate with the account teams to do that but those reports get written in Word and delivered in PDF, and the account team can’t write Markdown or deal with Jupyter and there is the disconnect.

"Curvenote maintains provenance and enables collaboration across those diverse groups.”

“So we see this being used as much in businesses with that need to integrate technical processes into business processes, as in academia. We’re exploring use cases there at the moment with geotechnical engineering and biotech companies that are using Curvenote for that collaboration right now.”

“So that is how we’ll get there. There are so many interesting problems the platform can solve.”

Interesting. Tell me about machine readability. Do you foresee more automation in scientific discoveries?

“It would enable automation for sure, because it would take one level of parsing out.”

“If someone today wanted to do that, someone wanted to create an API in one specific scientific field, take all PDFs, run a text extraction, recognise different pieces of content, figure out where the same things appeared across multiple documents, build that relationship itself, and then start to use that to understand how the work is connected.”

“Making it machine-readable in the first place, at source, removes a big part of that step.”

“There’s some interesting things about the ‘knowledge graph’ aspect of organizing connected information in a database that will be exciting to explore."

"With enough data maybe you see where scientific discoveries were being made – and perhaps likely to be made next?”

Wow. That’s a thought. Imagine a prediction algorithm with scientific premonition. A degree of certainty about where the next scientific breakthrough was likely to be made, guiding the focus of scientists. We’re in for a bizarre century.

Curvenote is free to use for individual researchers, with additional premium features available on paid plans. It’s publicly available and you can sign up on their website http://curvenote.com. Curvenote is also running pilot projects with businesses that want to see how Curvenote can fit their particular needs, from communicating across different disciplines to delivering reports to end customers. You can reach the founding team on founders@curvenote.com to start a conversation.

All science is becoming data science

This is the eleventh in a series of interviews with members of the Machine Commons supplier Collective. Subscribe to the site to be alerted about future posts, or become a partner today!

Recent Posts

Comments