GUWAHATI, March 6: Researchers at the Indian Institute of Technology (IIT) Guwahati have introduced a groundbreaking multilingual and scalable technique aimed at identifying and rectifying Surface Name Errors (SNEs) in Wikipedia. This innovation is set to bolster the reliability of information for both human readers and artificial intelligence (AI) systems.
A surface name is the term used in Wikipedia articles to reference or link to another entity, and a SNE arises when this term is incorrect.
The IIT Guwahati research team conducted a study revealing that approximately three to six percent of all entity mentions in Wikipedia contain SNEs. Although these errors may seem trivial, they can have profound consequences.
For users, an erroneous surface name can diminish the perceived trustworthiness and reliability of the information presented.
Moreover, numerous machine learning and deep learning models rely on Wikipedia as a primary dataset. Errors in surface names can adversely affect AI tasks and the performance of these models, according to the research team.
To tackle this issue, Prof Amit Awekar, an Associate Professor in the Department of Computer Science and Engineering at IIT Guwahati, along with MTech student Anuj Khare (2022 batch), developed a method utilizing mathematical frequency patterns, making it versatile across various languages. Their approach consists of three steps to classify SNEs.
The initial step involved scanning Wikipedia and transforming each link into a quadruplet that includes details about the page where the link is found, the page it directs to, the surface name used, and the surrounding textual context.
In the subsequent step, the method evaluated the surface name, deeming it correct only if it appeared at least ten times and constituted at least five percent of all links leading to a specific page.
Surface names failing to meet these criteria were flagged as potential errors.
The final step involved categorizing the identified errors into ‘typing mistakes’, such as ‘Gawahati’ instead of ‘Guwahati’, or ‘entity span errors’, where incorrect or additional words are mistakenly included in the link.
The researchers tested their method across eight languages, including English, Sanskrit, German, Italian, Urdu, Hindi, Marathi, and Gujarati, achieving accurate results.
Discussing the practical implications of their method, Prof Awekar stated, “This work emphasizes the importance of not blindly trusting data from the web, both for human users and for training AI models. Quality data is fundamental to any effective AI model and its subsequent applications.”
To validate their method, the research team compared snapshots of English Wikipedia from 2018 and 2022, discovering that around 30 percent of the errors predicted by their method had been rectified on Wikipedia over four years, confirming its effectiveness.
Wikipedia is curated by volunteers globally, and this method can assist editors in uncovering hidden typos and linking errors that might otherwise go unnoticed for extended periods, according to Prof Awekar. The Wikipedia community has accepted over 99 percent of the manual corrections proposed by the researchers.
-
International Women’s Day 2026 card ideas: Thoughtful handmade designs to honour inspiring women

-
Best gift giving ideas on Women’s Day

-
Political turmoil over Nitish Kumar’s Rajya Sabha nomination, brother-in-law Anil Kumar accused of conspiracy, said ‘Ravan is sitting in the party’

-
Big change for UPI users, now they will not be able to make payments through Google Pay-PhonePe

-
Launched in India with 6.8 inch HD+ LCD display and MediaTek Dimensity 6300 SoC, check details
