This is an important article if you're in a leadership, product development, or technical position.
Summary (in case you don't have access to NYT): Companies are intensifying efforts to acquire vast amounts of training data for AI, often circumventing policies and legal norms, as seen with OpenAI, Google, and @Meta's aggressive strategies, including unauthorized data transcription and potentially exploitative acquisitions. This race underscores the competitive drive to dominate AI technology and highlights significant ethical and sustainability concerns surrounding data harvesting practices.
This raises important questions for business leaders:
✺ How can companies ensure the quality and integrity of data in an era where the demand for extensive AI training datasets is rapidly increasing?
✺ How do we navigate the legal and ethical considerations in a balanced way?
✺ Will markets be created where companies can buy and sell targeted data in a way that respects copyrights and personal rights?
✺ How will we ensure ethical sourcing of the data that trains these industry-changing models?
My take is that:
➊ Leaders must balance the need for more data by deciding to (a) invest in high-quality, reliable data from trusted sources at a higher cost or (b) opt for cheaper, less reliable, more creative, and unproven approaches that may result in degraded models. It may be that certain models and tasks require more accuracy than others.
This choice is crucial, as the quality of data directly impacts the accuracy of AI insights and decision-making; it makes having a solid data strategy a strategic imperative. Since data will be the lifeblood of smart, AI-augmented products, Chief Product Officers will need to work closely with CTO to lead in data decisions.
➋ Every company should be thinking about the data they collect — and have the potential to collect — and how it may add (a) a defensive advantage for their products and (b) a revenue stream provided to a non-competing company.
➌ The human cost of quality data. At the macro-level, the leading companies have already ingested all the quality data that is readily available. These companies are now seeking new, innovative ways to generate unique training data. This is raising new technical, business, legal, and ethical questions.
Today, the world of data ranges from the most educated scientific minds studying "synthetic data". In contrast, in less developed corners of the globe, underpaid and overworked teams are "narrating" and "tagging" datasets in China, India, Venezuela, and Kenya.
Karen Hao from the Atlantic is one of the few professionals who is highlighting this dark underbelly. Her work is important to follow and can be found in MIT Technology Review and The Atlantic.
David Gleason and Peter Memon, curious what you make of the article.
Comments