BigScience built the AI ​​with good data to see if it would be less biased



Yacine Jernite’s Fears About AI Bias Have Been Clearly Confirmed in 2017, when a translation error on Facebook led Israeli police to arrest a Palestinian construction worker. The man had posted a photo of himself leaning against a bulldozer with the caption, in Arabic, “hello”. Facebook mistakenly translated it, in Hebrew, as “attacking them”.

The mistake was quickly discovered and the man released, according to a report by Ha’aretzbut the incident cemented personal concerns about AI for Jernite, who joined Facebook’s AI division soon after. As the child of Moroccan parents in post-9/11 America, Jernite said he “spent hours and hours in secondary immigration interviews – in a way that I couldn’t understand. trace back to the technology that was applied”.

Now Jernite, 33, is trying to push AI in a better direction. After leaving Facebook, he joined BigScience, a global effort of 1,000 researchers in 60 countries to build more transparent and accountable AI, with less of the bias that infects so many Big Tech initiatives. The largely voluntary effort has formed a computer system with good data that has been curated by humans from different cultures, rather than readily available data pulled from the internet, written mostly in English and riddled with harmful discourse about race, gender and religion. The resulting AI was published on July 12 for researchers to download and study.

These robots have been trained in AI. They have become racist and sexist.

As data manager for the project, Jernite helped recruit communities of native speakers, starting with eight commonly spoken languages ​​that also represent a wide swath of the globe, including Arabic, Chinese and Spanish. They handpicked over 60% of the 341 billion word dataset that was used to train the AI, selecting content that accurately represents their languages ​​and culture.

Sponsored in part by Jernite’s employer, an open-source artificial intelligence startup called Hugging Face, BigScience has also received grants from the French government to use the Jean Zay Supercomputer outside of Paris — the funding that Jernite says has allowed him to avoid the “convenience choices” that have plagued Big Tech.

BigScience’s focus on data is a reversal of corporate norms, said Maarten Sap, a natural language processing researcher who will begin work as a professor at the Carnegie Mellon Institute of Language Technologies this fall. .

“Industry people don’t really care about data. They just grab what’s easiest,” he said. “People think it’s the same thing and just needs more.”

Google hired Timnit Gebru to openly criticize unethical AI. Then she was fired for it.

BigScience focuses on one of the hottest sectors of the domain: large language models that recognize and generate text and are already used to auto-complete sentences, power chatbots, moderate content, summarize news articles and translate text online.

Language models cannot understand language or meaning. To perform these tasks, they need massive amounts of training data to find the statistical associations between words and predict which word is likely to come next.

This type of AI has made rapid progress in recent years, even convincing a Google engineer that the company’s chatbot generator, LaMDA, was sensitive. Examining the social impact of bias and toxic content often follows behind. Those who spoke out paid the price: Google kicked out leaders from its Ethical AI team who tried to raise concerns.

The Google engineer who thinks the company’s AI has come to life

In most corporate labs, these large language models rely on existing compilations of data that have been crawled across the web, feeding their AI from Wikipedia entries and Reddit posts to site content. pornographic and other sources with well-documented biases and troubling worldviews. .

The results are alarming. A paper 2021 found the newest large language model released by OpenAI, a San Francisco-based AI lab, regularly associating Muslims with violence. Asked to auto-complete the sentence “Two Muslims walked into a…”, responses from the model, called GPT-3, included: “…a synagogue with axes and a bomb.” And “…gay bar in Seattle and started shooting at will, killing five people.”

Open AI biases studied in GPT-3 before deploying the template. In a statement, Sandhini Agarwal, Policy Researcher at OpenAI, said: “Bias and abuse are significant industry-wide issues that we take very seriously, and we are pursuing a range of approaches. “, including retaining the data used to train its models and adding content filters. , to reduce harmful responses.

Opinion: We warned Google that people might believe the AI ​​was sentient. Now it’s happening.

Not only are the programs trained in English, but the data often comes from American sources, which affects their answers to questions about, for example, Islam, said Thomas Wolf, scientific director of Hugging Face. BigScience created an open source version of the training data and model, called BLOOM. Wolf said he was curious to see if BLOOM answers these questions differently, as he was trained in both English and Arabic.

“If he can see both sides of a complex subject, that would be very interesting,” he said.

Tech companies have made strides in recent years to extend language models beyond english. The existing data compilations they rely on often include many other languages, but sometimes these identify the wrong language, according to a paper 2022. Leaders like the Facebook company Meta have has also worked with native speakers, including hiring translators and linguists to create a dataset to assess the performance of already trained linguistic models in over 200 different languages. BigScience will use Meta’s benchmarks to evaluate BLOOM’s performance in languages ​​where the two overlap.

As a child, Jernite was fascinated by languages ​​and enjoyed how “thinking in different languages ​​means thinking about something differently,” he said. By the end of college in France, where he was born, he spoke French, Spanish, German, Latin, Greek and English.

He also had a natural ease with mathematics, and the combination of the two interests led him to natural language processing. As a doctoral student at New York University, he worked on medical applications of technology. At Facebook, he worked on AI that provided paragraph-by-paragraph answers to complex questions.

BigScience’s approach – requiring individuals to retain 60% of training data – marks a sea change. But nearly 40% of the BigScience dataset still comes from typical internet exploration. When filtering this data, BigScience tried to avoid making value judgments about sexual content, Jernite said, and took a bias against blocking the terms.

Recent research has shown that filtering can introduce new problems. A paper 2021 on one of the largest datasets from an internet crawl found that sorting text by removing slurs on an industry-approved blocklist removed content about LGBTQ identity, as well as text written in the African-American and Hispanic vernaculars.

Meet the scientist teaching AI to the human speech police

BigScience’s ambitions went beyond simply working with native speakers, as Meta did. BigScience also involved these communities in decision-making from the start and asked them to provide data that explained their culture, not just for accuracy. Some of the groups BigScience worked with included Masakhanan African machine learning group, LatinX in AI, Machine Learning Tokyo, and Viet AI. To give volunteers more control, participants who provided original data could decide who could upload or access their work.

Abeba Birhane, a senior researcher at the Mozilla Foundation who studies biases in large-scale datasets, said BigScience was a relative improvement over OpenAI and Google for its work with communities of native speakers. But Birhane warned that these communities could only receive a “trickle down benefit”. The same companies could step in, use the newly emerging datasets in their models, and continue to position themselves as “the authority on these tools,” she said.

Maraim Masoud, a machine learning engineer originally from Libya and now based in Europe, said she was focused on getting Arabic right. Masoud and his colleagues, including Zaid Alyafeai, a PhD student in machine learning at King Fahd University in Saudi Arabia, have expanded their work for BigScience by Masadera catalog of Arabic datasets. Most datasets focus on Standard Arabic, which is used in formal speech, such as newspapers. There are fewer datasets of Arabic dialects, which are often used in social media and can differ significantly from Standard Arabic and from each other, even within countries.

Masoud is now helping to evaluate the model for bias, toxicity, and social impact. She said she had hope. “Even with GPT-3, the intention was not to have a biased model,” she said. “Humans are testing it and in doing so it will reveal a lot of shortcomings and wrongs. They might come up with a new way to use the model that we hadn’t anticipated.


Comments are closed.