АІ-UniBot: Identifying and Removing Duplicate Files

Quick access to information and cheap storage are just one side of digital progress. Our Developers focused on another one: cluttering up storage with unnecessary duplicates. Storing multiple document copies seems less expensive than manually searching for and deleting each one. Usually, the process of recognizing and cleaning memory from duplicates requires resources that you don’t want to waste on it. At least a few hours of Employees’ working time should be regularly allocated for such «cleaning». So this prompted most Organizations to simply ignore this issue. Therefore, almost each of us has several versions of the same document. However, now we have entrusted the «big cleaning» to UniBot Personal Assistant & Corporate Chatbot.

Ironically, when using Artificial Intelligence (AI) to search for files or provide answers to inquiries, the situation only worsens. Because duplicate documents become a serious obstacle that affects the efficiency of the system. They not only increase the time and financial costs of indexing but also reduce the accuracy and relevance of the answers provided by AI. After all, the system has to process several identical fragments.

Therefore, we taught UniBot to recognize duplicates and clean the semantic index from them. Our strategy for identifying copies is based on a deep analysis, which includes not only comparing document names but also their content itself. This allows us to accurately detect duplicates, even if the modification date changes without actual changes. For example, once a document is opened in Office Online. Thus, UniBot avoids errors when detecting duplicates. This not only optimizes the information processing but also increases the efficiency of the system as a whole. After all, Users get access to more relevant and accurate answers.

Let's consider the case of a large Corporation that seeks to provide its Employees with access to current policies, procedures, and standards. Over the years, as a result of data migrations, erroneous saving, and duplication of files between different Departments, a significant number of copies of documents have been accumulated. And in the end, this became a serious obstacle to effective work with files. When the Corporation began using the new function in UniBot, it turned out that the same document had 1.3 duplicates on average. After detecting these copies, UniBot automatically deleted them. Thanks to this, the Organization was able to significantly save on financial costs: more than twice as much on file indexing and 4-5 times more on the AI analysis of information when preparing answers for Employees. In addition, cleaning the semantic index improved data quality and also optimized infrastructure costs, reducing the load on Azure resources. Thus, the Corporation reduced costs and increased the efficiency of working with documents at the same time. Quick access to the most relevant data was provided.

Another example is a Research Company that actively collects data from the Internet on its own, and also requests it from various standardization Organizations. In such Companies, files often have different names and formats, but the content remains constant. This leads to the fact that the number of duplicates can reach 4-5 for each document. Before the implementation of the copy recognition function in UniBot, even using the DeepSearch tool, the system could not always find the necessary information. After all, the number of search attempts is limited. However, the ability to recognize duplicates has significantly changed the situation. Now the Company can not only optimize the costs of searching and processing data but also remarkably increase the speed and quality of query processing. As a result, the efficiency of the Researchers' work has significantly increased.

In short, the function of identifying duplicates and cleaning the semantic index from them in UniBot Personal Assistant & Corporate Chatbot significantly increases the efficiency of working with documents. At the same time, Users, avoiding the need to dwell on copies, quickly receive the most relevant answers. And Customers reduce the costs of processing unnecessary data. Thus, UniBot creates a more efficient information environment for Organizations that value accuracy and speed in data processing.