I spent some of this afternoon watching the Mozfest 2021 panel session entitled AI and the Environment: Friend or Foe? which was super interesting – one of the panelists was Khari Johnson, a writer for VentureBeat who’s been covering the relationship between ML and environment for some time. One thing that particularly caught my attention was this quote from Andrew Ng at NeurIPS 2019 that quite explicitly makes the link between scale of data collection, surveillance capitalism and sustainability:
Andrew Ng, along with other panelists, called for progress on ML that works with small data sets and applications like self-supervised learning in tandem with transfer learning so that training models requires less data.“A lot of machine learning, modern deep learning, has grown up in the large consumer internet companies, which have billions or hundreds of millions of users, and have large data sets, in the climate change setting when we look inside their imagery,” he said. “Sometimes we have only hundreds or maybe thousands of pictures of wind turbines or whatever… [with] these very small data sets, I find that you need new techniques in order to address them, and [what] I see broadly is that for machine learning to break into other disciplines outside software [and] internet [companies], we need better techniques to deal with small-data or low-data regimes.”
Compare and contrast with this from Francois Chollet a few weeks ago on Twitter:
An under-appreciated feature of our present is how we record almost everything — far more data than we can analyze. Future historians will be able to reconstruct and understand our time far better than we perceive and understand it right now.
— François Chollet (@fchollet) February 14, 2021
It’s interesting to see the tension between these two attitudes, both coming from within the ML research community. On the one hand, a recognition that ‘smaller scale’ ML methods will both make them more sustainable but also more useful as they’ll be able to be applied usefully, outside the rarefied environments of the big tech companies – on the other a commitment to the idea that ‘more data always == better’, and therefore to diogenes syndrome at an institutional level. This also nicely links my two ideas that one the one hand learning to throw away data is a sustainability issue (considering not just the costs of storing it, but of the potential future uses of it to train ML models that storing it affords), but also an epistemic issue in that what is left out of an archive of information helps define the meaning of the archive as a whole.