06-08-2023 14:56 PM - last edited 05-09-2024 09:00 AM
Find and Remove Duplicates From Any Datasource
Need to identify and/or remove records in a datasource based on duplicate values in one or more columns?
This template uses a SharePoint list to demonstrate, but this method ultimately pulls in the data from any source as a standard JSON array and processes it to identify all the duplicates based on the selected columns. From there the duplicates can be easily deleted.
So it will work for SharePoint, Excel, SQL, Dataverse, API GET/DELETE calls to external datasets, and any other dataset used in Power Automate.
Example...
Use any get data action to pull the data into the flow. Use built in action functions like Order By or later use a Power Automate Sort( ) expression and Power Automate Reverse expression to format your data in ascending or descending order by your chosen column.
Sorted by date, Ascending will keep the oldest & delete the newest, while Descending will keep the newest & delete the oldest.
Sorted by number, Ascending will keep the smallest & delete the largest, while Descending will keep the largest & delete the smallest.
Move to the Select DuplicateCheckFields action and fill the column name & dynamic content value for each data source column you want to include in the check for duplicates. Only records where all listed columns match will count as duplicates.
The middle actions will process the records, identify the duplicate records, & return a JSON array of only duplicate records.
Put whatever action you will use to delete the records inside the loop and use items('Apply_to_each')?['InsertColumnNameHere'] to reference fields within the apply to each loop. For example, this SharePoint Delete item action just required the ID primary key field for each item to delete, so I used items('Apply_to_each')?['ID'] in the expression editor on the Id input.
Starting Demo SharePoint List
Final Result on The Demo SharePoint List
Rows with IDs 14, 16, & 19 were removed as both the Title & Column 2 fields on those rows matched other rows in the dataset. Notice rows with IDs 15 & 20 were not removed. Although they contain duplicate Title values, the Column 2 values were still unique for those rows.
Thanks for any feedback, & please subscribe to my YouTube channel (https://youtube.com/@tylerkolota?si=uEGKko1U8D29CJ86)
V1.1 added a Reverse( ) expression to the processing actions to make the records kept by ascending or descending sort more intuitive for the user-inputs.
For anyone having trouble with the legacy import method, here is a post on an alternate import method: Re: Find and Remove Duplicates - Page 3 - Power Platform Community (microsoft.com)
watch?v=49xj4cN6OFI
Check the alternate version of this flow for getting distinct records: https://powerusers.microsoft.com/t5/Power-Automate-Cookbook/Get-Distinct-Records/td-p/2191533
Great workaround. Microsoft should still create an array expression to remove duplicates after these years. Unacceptable.
Hey, I'm new at Power Automate and even coding in general. When you're adding the expressions in each step, what exactly are you doing? Could you clarify with more in-between steps?
Thanks
-R
The 1st Select is used as an input action. It iterates over each record & results in a JSON object with a key & value for each column we want to use to check for duplicates. It also has a key & value pair where the value is the entire JSON object for the record. And a guid property is added to make sure each record is unique because if each wasn’t unique & there were two or more records that exactly matched in every way, then it would mess up a later step.
The next Select reformats each item into a JSON object with only 2 key & value pairs. One value is the entire record, the other is an object / record with only the columns/fields we want to check for duplicates.
The last Select is where most things come together.
For each of the items reformated into those 2 key value pairs, It takes the part that is a copy of the entire original record & adds a property (another key value pair) that marks each record as a Duplicate 1 or Not a Duplicate 0.
The way it calculates this is the key part of the flow. For each item from the previous Select, it converts the entire list / array of items from the Select into a string. From there it splits the string of all items on the stringified instance of the current item (which is definitely unique bc of the guid so the array returned only has 2 pieces) & skips 1 of those pieces in that array (everything up to the current item in the JSON array of all items) which means it now has a string with only the items later in the array than the current item. So if it then splits on the current item’s stringified Duplicate Columns the array returned on that split will have a length of 1 if no other Duplicate Columns object matched or a length greater than 1 if a match was found in all remaining items after the current item in the JSON array.
This is a rather advanced & compact use of multiple expressions. The goal was to make something really efficient that anyone could pop into their flows, removing duplicates in an instant. But it’s not the easiest to understand the actual details of how it does what it does.
Damien just did a video on an alternate version that still removes duplicates without loops if you want to check that: https://youtu.be/SGXAqAzYUSM
Here is a 1 action trick to remove duplicates... Valid until Microsoft wakes up and decide to read these forums and build a specific array action or expression to fill this huge void after all these years. Power Automate is a dormant product, the team must have 3 people at most.
Remove Duplicates from a Power Automate Array with One Action - FlowJoe.io
@asdfaf4
That is the standard method of removing duplicate values if you only need distinct values from one column. The point of this & similar flows is likely to remove duplicates from an existing data-set, in which case you need to somehow match each duplicate value with their record ids. This was usually done with an additional Apply to each & Filter array action, but this allows people to do this without any slower Apply to each loop.
This method also enables the easy identifying of duplicates based on multiple columns, something that takes even more actions & expressions with the Union( ) method.
I have also seen a few more niche use-cases where people just want to efficiently find all the duplicate records & then do something else with them. Some of these require other values in the rest of the record(s) as well.