Ticker

6/recent/ticker-posts

Thousands of passwords are used to train chatgpt and the other AI

Thousands of passwords are used to train chatgpt and the other AI

Nearly 12,000 sensitive pieces of information, such as API keys and passwords, were discovered within Common Crawl by researchers from Truffle Security. Common Crawl is the name of a vast corpus of open source data. Collected since 2008 from across the web, this data is notably used to train AI models. Giants such as OpenAI, DeepSeek, Google, Meta, Anthropic and Stability use the dataset to train their language models (Large Language Model, or LLM). This is partly thanks to This data is what AIs like ChatGPT use to evolve and learn to respond to their users’ queries.

Nearly 12,000 pieces of confidential information provided to AI

According to the researchers, who combed through 400 terabytes of data from 2.67 billion web pages, the repository includes 11,908 pieces of confidential information. The scan relied on TruffleHog, a security tool open source software designed to search for sensitive information, such as API keys, passwords, or other secrets.

They ended up in the hands of artificial intelligence during their training. This discovery “highlights a growing problem: LLMs trained on insecure code can unintentionally produce risky results”. In short, AIs could leak information in one way or another, and produce responses that include sensitive data. It should be remembered, however, that the data used to train major language models are always processed upstream. This processing allows the data to be cleaned by excluding duplicates and harmful or useless information.

Among the data found in the corpus, there are valid API keys providing access to services such as Amazon Web Services (AWS) or MailChimp. The researchers mainly found a wealth of keys for MailChimp, the email automation platform.

The developer error

As Truffle Security explains in its report, developers made the mistake of directly inserting sensitive data (such as identifiers or API keys) into the code of HTML forms and JavaScript scripts. Some keys even recurred several times, which maximized the risks.

Following its discovery, Truffle Security entered in contact with all the entities whose keys and passwords ended up in the hands of the AI. With the help of the researchers, the companies were able to “collectively rotate/revoke several thousand keys” as a security measure.

Source: Truffle Security

Post a Comment

0 Comments