06-08-2023 14:56 PM - last edited 05-09-2024 09:00 AM
Find and Remove Duplicates From Any Datasource
Need to identify and/or remove records in a datasource based on duplicate values in one or more columns?
This template uses a SharePoint list to demonstrate, but this method ultimately pulls in the data from any source as a standard JSON array and processes it to identify all the duplicates based on the selected columns. From there the duplicates can be easily deleted.
So it will work for SharePoint, Excel, SQL, Dataverse, API GET/DELETE calls to external datasets, and any other dataset used in Power Automate.
Example...
Use any get data action to pull the data into the flow. Use built in action functions like Order By or later use a Power Automate Sort( ) expression and Power Automate Reverse expression to format your data in ascending or descending order by your chosen column.
Sorted by date, Ascending will keep the oldest & delete the newest, while Descending will keep the newest & delete the oldest.
Sorted by number, Ascending will keep the smallest & delete the largest, while Descending will keep the largest & delete the smallest.
Move to the Select DuplicateCheckFields action and fill the column name & dynamic content value for each data source column you want to include in the check for duplicates. Only records where all listed columns match will count as duplicates.
The middle actions will process the records, identify the duplicate records, & return a JSON array of only duplicate records.
Put whatever action you will use to delete the records inside the loop and use items('Apply_to_each')?['InsertColumnNameHere'] to reference fields within the apply to each loop. For example, this SharePoint Delete item action just required the ID primary key field for each item to delete, so I used items('Apply_to_each')?['ID'] in the expression editor on the Id input.
Starting Demo SharePoint List
Final Result on The Demo SharePoint List
Rows with IDs 14, 16, & 19 were removed as both the Title & Column 2 fields on those rows matched other rows in the dataset. Notice rows with IDs 15 & 20 were not removed. Although they contain duplicate Title values, the Column 2 values were still unique for those rows.
Thanks for any feedback, & please subscribe to my YouTube channel (https://youtube.com/@tylerkolota?si=uEGKko1U8D29CJ86)
V1.1 added a Reverse( ) expression to the processing actions to make the records kept by ascending or descending sort more intuitive for the user-inputs.
For anyone having trouble with the legacy import method, here is a post on an alternate import method: Re: Find and Remove Duplicates - Page 3 - Power Platform Community (microsoft.com)
watch?v=49xj4cN6OFI
OK, I'm passed that... Now I'm getting: "Flow save failed with code 'InvalidPaginationPolicy' and message 'The pagination policy of workflow run action 'Get_items' of type 'OpenApiConnection' at line '1' and column '4200' is not valid. The value specified for property 'minimumItemsCount' exceeds the maximum allowed. Actual: '100000'. Maximum: '5000'.'."
@pfoltz If you need to you can check the parts of this template that read in a SP list in 5000 item batches inside a Do until loop before re-combining them into a single JSON array: https://powerusers.microsoft.com/t5/Power-Automate-Cookbook/Batch-Update-Create-and-Upsert-SharePoin...
Hi Could you please help me?
I tried to import the zip file to power apps, however keep having error.
@Joshualumens That's a new issue that has been appearing for solutions with connections originating from outside the solution. Here try solution version 7 where those should be removed.
I had an issue with my flow, getting an error on step Get Items : "The attempted operation is prohibited because it exceeds the list view threshold."
Even if I changed my list view etc... still had the issue.
The problem is that the step Get items, can't sort if there is more than 5,000 rows. So you need to remove that part if you want to make it run. Please note, than mean you don't know if it will delete the old or new value. In my case it didn't really matter, but for others it might.
There is a piece of this template you could copy that pulls the SharePoint list items in 5000 item increments before combining them into a single JSON array.
That would get around the list view threshold of 5000.
The reference for the file identifier is {Identifier} like...
items('For_each')?['{Identifier}']