06-08-2023 14:56 PM - last edited 05-09-2024 09:00 AM
Find and Remove Duplicates From Any Datasource
Need to identify and/or remove records in a datasource based on duplicate values in one or more columns?
This template uses a SharePoint list to demonstrate, but this method ultimately pulls in the data from any source as a standard JSON array and processes it to identify all the duplicates based on the selected columns. From there the duplicates can be easily deleted.
So it will work for SharePoint, Excel, SQL, Dataverse, API GET/DELETE calls to external datasets, and any other dataset used in Power Automate.
Example...
Use any get data action to pull the data into the flow. Use built in action functions like Order By or later use a Power Automate Sort( ) expression and Power Automate Reverse expression to format your data in ascending or descending order by your chosen column.
Sorted by date, Ascending will keep the oldest & delete the newest, while Descending will keep the newest & delete the oldest.
Sorted by number, Ascending will keep the smallest & delete the largest, while Descending will keep the largest & delete the smallest.
Move to the Select DuplicateCheckFields action and fill the column name & dynamic content value for each data source column you want to include in the check for duplicates. Only records where all listed columns match will count as duplicates.
The middle actions will process the records, identify the duplicate records, & return a JSON array of only duplicate records.
Put whatever action you will use to delete the records inside the loop and use items('Apply_to_each')?['InsertColumnNameHere'] to reference fields within the apply to each loop. For example, this SharePoint Delete item action just required the ID primary key field for each item to delete, so I used items('Apply_to_each')?['ID'] in the expression editor on the Id input.
Starting Demo SharePoint List
Final Result on The Demo SharePoint List
Rows with IDs 14, 16, & 19 were removed as both the Title & Column 2 fields on those rows matched other rows in the dataset. Notice rows with IDs 15 & 20 were not removed. Although they contain duplicate Title values, the Column 2 values were still unique for those rows.
Thanks for any feedback, & please subscribe to my YouTube channel (https://youtube.com/@tylerkolota?si=uEGKko1U8D29CJ86)
V1.1 added a Reverse( ) expression to the processing actions to make the records kept by ascending or descending sort more intuitive for the user-inputs.
For anyone having trouble with the legacy import method, here is a post on an alternate import method: Re: Find and Remove Duplicates - Page 3 - Power Platform Community (microsoft.com)
watch?v=49xj4cN6OFI
Hi @takolota!
Thanks for your excellent solution. I've been using it for an SP list for a couple of months flawlessly to detect a completed process. When the process starts, it creates a line in a list. Then the process can take different paths and times to go on. When the second part finishes, it writes a second line in the list.
I use your flow to check when/if I have two lines with the same ID to know if the process has ended.
Now I'm trying to use your solution again for a completely different task:
find and merge duplicate contacts from Dataverse (right now we have almost 3500+ contacts). Our contact table is created from several sources/people, so it contains some duplicates (about 450+ lines with duplicates/triplicates). However, we want to use a couple of custom fields to check if the contact is duplicated.
How should I replace the Get Items and the Values in your flow? With a List items?
You'll want to replace the Get items action with List rows and replace the Select From input with the Dataverse Values and replace the fields to check for duplicates in the Select Map input with your Dataverse fields...
Hi!
For some reason long to explain, I needed to mark all the duplicates, so I did a modification to the Select ResetToRecordWithIsDuplicateField action. Instead of:
addProperty(
outputs(
'Compose_SortedJSON')[item()]?['RecordJSON'],
'IsDuplicate',
if(
and(
not(
equals(
0,
item()
)
),
equals(
outputs('Compose_SortedJSON')[item()]?['DuplicateFieldsJSON'],
outputs('Compose_SortedJSON')[sub(item(),1)]?['DuplicateFieldsJSON']
)
),
1,
0
)
)
I changed that to:
addProperty(
outputs(
'Compose_SortedJSON')[item()]?['RecordJSON'],
'IsDuplicate',
if(
and(
not(
equals(
0,
item()
)
),
or(
equals(
outputs('Compose_SortedJSON')[item()]?['DuplicateFieldsJSON'],
outputs('Compose_SortedJSON')[sub(item(),1)]?['DuplicateFieldsJSON']
),
equals(
outputs('Compose_SortedJSON')[item()]?['DuplicateFieldsJSON'],
outputs('Compose_SortedJSON')[add(item(),1)]?['DuplicateFieldsJSON']
)
)
),
1,
0
)
)
So I'm comparing not only the next line but also the previous one. It works, but I think this is doubling the work, right? Can you think of a better way to mark all the duplicates in your flow?
There should be examples of referencing different columns in the template example. You’ll need to copy that reference pattern for whatever column is your primary key (field with unique values for every row)