Microsoft is launching a research project to estimate the impact of specific training examples on text, images, and other types of media generated by the AI model.
This was a job listing dates back to December and was recently recirculated on LinkedIn.
According to a list exploring research interns, the project seeks to demonstrate that a model can be trained to allow certain data (e.g. photographs and books) to the output to be “efficiently and usefully estimated.”
“Current neural network architectures are opaque in that they provide sources to generations. There are good reasons to change this,” reads the list. “(One of which) could pay assuming that it will fundamentally surprise the future to those who provide incentives, perceptions, and certain valuable data to an unexpected kind of model.”
AI-powered text, code, images, videos and songs generators are at the heart of many IP litigation against AI companies. In many cases, these companies train their models with a huge amount of data from public websites, some of which are copyrighted. Many companies argue that fair use doctrines protect data scrapping and training practices. However, creatives, from artists to programmers to authors, rarely agree.
Microsoft itself faces at least two legal challenges from copyright owners.
The New York Times sued the Tech giant and its collaborator Openai in December, accusing the copyright of The Times by deploying a model trained with millions of articles. Several software developers have also filed lawsuits against Microsoft, claiming that the company’s Github Copilot AI Coding Assistant was illegally trained using protected works.
Microsoft’s new research efforts, which the listing describes as “the origins of training time,” reportedly have the involvement of Jaron Lanier, a skilled technician and interdisciplinary scientist at Microsoft Research. During the April 2023 OP-Ed at New Yorker, Ranier wrote about the concept of “dignity of data.” This meant connecting “digital staff” with “people who want to be known for creating it.”
“The data dignity approach tracks the most unique and influential contributors when a large model provides valuable output,” writes Ranier. “For example, if you ask a model for “animated films of my children in an oiled world where cats speak on adventures,” you might calculate that a particular key oil painter, cat portrait painter, voice actor, writer, or their estate is uniquely essential to creating a new masterpiece.
There are none, but there are already several companies trying this. Bria, an AI model developer who recently raised $40 million in venture capital, claims that he “programmatically” compensates data owners in accordance with “overall impacts.” Adobe and Shutterstock also award recurring payments to dataset contributors, but the exact payment amount tends to be unclear.
Few large labs have established payment programs for individual contributors other than ink licensing agreements with publishers, platforms and data brokers. Instead, they provided a means for copyright holders to “opt out” their training. However, some of these opt-out processes are tedious and apply only to future models, not previously trained models.
Of course, Microsoft’s projects can be nothing more than proof of concept. There is a precedent for this. In May, Openai said it was developing similar techniques that allow creators to specify how they want their work or how they should be included in their training data. But almost a year later, the tool has yet to see the light of day. Often, it is not considered an internal priority.
Microsoft may also be trying to “ethical cleansing” here. Or take down regulations and/or court decisions that are destructive to your AI business.
However, the company is investigating how it tracks training data, which is noteworthy given the recently expressed stance on fair use of other AI Labs. Several of the top labs, including Google and Openai, have published policy documents recommending the Trump administration weaken copyright protections when it comes to AI development. Openai explicitly asks the US government to codify the fair use of model training.
Microsoft did not immediately respond to requests for comment.