cancel
Showing results for 
Search instead for 
Did you mean: 
Reply

REGEX Capture Groups via Script - Javascript or Python

Hi Everyone,

 

I need to run a REGEX through a large text file (very badly formatted CSV with oodles of issues in parsing) and return the results to PAD. The only way I can see to achieve this is via either JS or Python script. I am currently working with a JS implementation using the following:

 

Set Variable called %REGEX% = ^(?<SKU>[\\d\\w]+),\\d*,\"?(?<Description>.+)\"?,\\$(?<Price>[\\d\\.]+)

Read Text from File into %FileContents%

Run Javascript&colon;

var csv = %FileContents%;
var reg = %RegEx%;
var out = csv.matchAll(reg);
WScript.Echo(out);

 

This does not work 😞

 

I receive the error (truncated) --> Microsoft JScript compilation error: Expected ';'

 

Can anyone point me in the right direction to solving this? I'm guessing that it is something syntactic with the REGEX itself but I am not sure how to debug this particular scenario.

 

Any help greatly appreciated.

 

Cheers

 

MisterH

 

1 ACCEPTED SOLUTION

Accepted Solutions

Hi Everyone,

 

I have found a way to get this really dirty data from a CSV into PAD in a clean way that takes only seconds, using the scripting engine capabilities.

 

The way to achieve a result is as follows:

  1. Select the filename (with it's path) that you want to process
  2. Set a variable with the REGEX pattern you want to use to extract the data from the file
  3. Using the Python Script step run the REGEX on the file and return the result
  4. Get a temporary filename
  5. Export / Save the extracted data, as text, to the temporary file
  6. Load the data from the temporary file with the CSV step to get a 'clean' read of the information
  7. Delete the temporary file since it's no longer needed

In my case the specific Python script I am using is the following (this is some really dirty data with loads of special characters in it, extra commas and quote marks, you name it - the REGEX pattern does the work of grabbing the correct chunks of text from each line):

import re								#Import the REGEX Engine

r = r'''%RegEx%'''	                                         	#Get PAD variable RegEx (raw text)
f = '''%CSVFile%'''                                         	#Get PAD variable CSVFile (the data file)

#The REGEX pattern must be compiled. Regex being used is      ^([\d\w-]+),[\d\w /]*,"?(.+[^"])(?:"+)?,\$([\d\.]+),.*$
p = re.compile(r, re.MULTILINE)                   	#Compile the RegEx as a MULTILINE pattern

with open(f, 'r') as file:                                  	#Open the CSVFile for reading
    txt = file.read()                                         	#And read the entire contents into 'txt'

m = p.findall(txt)                                           	#Match the MULTILINE RegEx to capture all groups

OUT = '"SKU"|||"DES"|||"BUY"\n'			#Set the header row
for idx,tup in enumerate(m):
	SKU = tup[0]							#Get the SKU
	DES = re.sub('[\'\"\+\&\\-\.\(\)\*/\,#=]',' ',tup[1])	#Extract a clean description
	DES = re.sub(' +', ' ', DES)				#Remove multiple whitespaces from cleanup
	BUY = tup[2]							#Get the price
	OUT = OUT + '"' + SKU + '"|||"' + DES + '"|||' + BUY + '\n'		#Add a new line of data
	
OUT = OUT.strip('\n')						#Remove trailing newline character from last line
print(OUT)								#return the result to PAD

 

The entire PAD process looks like this:

 

20220713_111812_PAD.Designer_v5OLFRDh08.png

 

I hope that this helps anyone trying to deal with large quantities of poorly formatted CSV or TXT data. The process is relatively quick too - about 40 seconds for 16k+ lines.

 

Cheers

 

MisterH

View solution in original post

17 REPLIES 17

PS - The REGEX pattern works perfectly for it's intended task (created and tested in REGEX Buddy)

In javascript every line should end with a semicolon ;

So correct that and if still facing issues then post the original script here to check for errors.

VJR
Multi Super User
Multi Super User

I think after re-reading now I understood your above code better.

The first three lines are your PAD actions and the rest is your Javascript code? Is it?

Also check out "Parse Text" which has an option to turn on regex and generate the output.

Hi VJR,

 

Yes, first three lines are the steps in PAD.

1/ Set a variable to hold the REGEX pattern

2/ Read the text 'raw' from the file

3/ Run the JavaScript step below

 

The remaining lines are the JS itself, with %FileContents% and %RegEx% as the PAD variables holding their respective data, and being passed from PAD to the JS step.

var csv = %FileContents%;
var reg = %RegEx%;
var out = csv.matchAll(reg);
WScript.Echo(out);

 

 Sorry for the lack of clarity previously. I hope this explains things a little better.

 

MisterH

No issues, have you checked out the Parse Text action I have mentioned above. It has an in-built option where you can pass the regex and get the output.

PS (again) - the Parse Text step is going to be very inefficient for processing the data. Each one of these files has more than 16k lines, and there are going to be a continuous stream of them all poorly formatted due to the process that produces them - beyond my control I'm afraid. The number of data points I need to extract per line is 3 (as you can see from the capture groups in the REGEX), however handling this in a single step would be a vast improvement and in theory should be possible with either JS or Python from what I can tell. I'm just not sure how to correctly pass in the data for it to work.

I read this message in your original post but could not co-relate it with the code written in JS.

Didn't co-relate the second time too. Its not about your explanation but about me :).

 

I think below is the same equivalent of what you are trying to achieve via the javascript code.

When you disable the "First occurrence only", the matches variable will return multiple matches.

Post back if it much more complicated.

 

VJR_0-1657003634781.png

 

Here is a sample data set with some of the 'cleaner' data that is available in one of the files:

 

ACEFOBKIT,,"CS ACE WIEGAND 1 DOOR KIT, INC CS4890 CONTROLLER, HID READER & 10 X SEOS FOBS",$123.00,1 Door Kits,CST,
ACEFOBKITSTART,458922514,"CS ACE WIEGAND 1 DOOR KIT, INC CS4890 CONTROLLER, PROG KEYPAD HID READER & 10 X SEOS FOBS",$123.00,1 Door Kits,CST,
CS4836,,CS iKEY 1 DOOR KIT 4836 with 10 iKEYS,$123.00,1 Door Kits,CST,
CS4828,,CS iKEY 2 DOOR KIT 4828 with 10 iKEYS,$123.00,2 & 4 Door Kits,CST,

 

and yes, there is a comma(,) at the end of every line.

As a side note to my above screenshot on Parse Text, to access PAD variables in a script enclose quotes around it

 

var csv = "%FileContents%";

Parse Text can do multi-match, but it doesn't seem to handle multiple capture groups and simply returns the string (entire length) between where the match starts and finishes. I need to specifically extract the capture groups, which would mean running Parse Text 3 times per line of text. This means on a 16K line file there are 48k steps inside of a for each loop, giving a total of 80k steps in total. Horrifically inefficient.

Getting a strange error with this --> JScript compilation error: Unterminated string constant

 

I wonder if this is due to the raw text containing quotation marks?

I did not get any error after running the javascript, not did it return any results.

 

Moreover, on testing the regex with the text you shared it is not returning anything 

 

https://regex101.com/

 

VJR_0-1657006474379.png

It does, you need to unescape the double-escaped slashes.

 @MisterH, please refer @mscheetham's suggestion regarding the regular expression that you found to be working in REGEX Buddy.

Ok Everyone,

 

Thanks for having a crack at this. Here's an update with a little more research and experimentation done:

  • The JavaScript step is not going to be able to run the necessary code due to how badly munted the CSV data actually is. The contents of the file are poor and no amount of cleaning will guarantee a viable result.
  • The JavaScript step will bomb-out at the point of taking the PAD variable ('FileContents') due to how malformed the data is - this is the cause of most of the errors

Switching to the Python step instead:

  • Updated the REGEX to a Python 2.7 standard (no named capture groups)
  • Had to Parse/Replace any and all double quote marks in the 'FileContents' with 'nothing'
  • Updated Python code runs without error, however I am struggling to now output the results (see below)
import re
txt = r"""%FileContents%"""
reg = """%RegEx%"""
p = re.compile(reg)
m=p.match(txt)
print(m)

 

I am guessing that I am going to need to do something with the matches to output them in another data type to get the results back into PAD. When I try to use the regex string as a test output I am also not getting anything back - this is a simple print statement for the variable 'reg' in the code.

 

Any ideas?

 

MisterH

Hi VJR,

 

I am sorry that the copy / paste from Windows into this forum 'escaped' the slash characters. I did not see that when I made the original post. My apologies.

 

In the running process the regex is not 'escaped' and is as suggested it should be. Now it has had a few minor updates to accommodate some more of the vargaries that this file keeps coming up with - an endless set of pitfalls and traps it seems.

 

^([\d\w-]+),[\d\w /]*,"?(.+[^"])(?:"+)?,\$([\d\.]+),.*$

 

There is no way for JS to handle the raw text of the file, so I have switched to Python to try and handle it. I am thinking it is going to have to be built solely in Python with an output being provided back to PAD.

1/ Get the filename to process

2/ Pass the filename into the Python Script step

3/ Have Python load and process the file using REGEX

4/ Have Python construct a suitable output format / variable to return to PAD

5/ Give the result to PAD and continue with the next steps

 

I'll update as I go. I hope that this might be helpful to others when I get it finished.

 

MisterH

Hi Everyone,

 

I have found a way to get this really dirty data from a CSV into PAD in a clean way that takes only seconds, using the scripting engine capabilities.

 

The way to achieve a result is as follows:

  1. Select the filename (with it's path) that you want to process
  2. Set a variable with the REGEX pattern you want to use to extract the data from the file
  3. Using the Python Script step run the REGEX on the file and return the result
  4. Get a temporary filename
  5. Export / Save the extracted data, as text, to the temporary file
  6. Load the data from the temporary file with the CSV step to get a 'clean' read of the information
  7. Delete the temporary file since it's no longer needed

In my case the specific Python script I am using is the following (this is some really dirty data with loads of special characters in it, extra commas and quote marks, you name it - the REGEX pattern does the work of grabbing the correct chunks of text from each line):

import re								#Import the REGEX Engine

r = r'''%RegEx%'''	                                         	#Get PAD variable RegEx (raw text)
f = '''%CSVFile%'''                                         	#Get PAD variable CSVFile (the data file)

#The REGEX pattern must be compiled. Regex being used is      ^([\d\w-]+),[\d\w /]*,"?(.+[^"])(?:"+)?,\$([\d\.]+),.*$
p = re.compile(r, re.MULTILINE)                   	#Compile the RegEx as a MULTILINE pattern

with open(f, 'r') as file:                                  	#Open the CSVFile for reading
    txt = file.read()                                         	#And read the entire contents into 'txt'

m = p.findall(txt)                                           	#Match the MULTILINE RegEx to capture all groups

OUT = '"SKU"|||"DES"|||"BUY"\n'			#Set the header row
for idx,tup in enumerate(m):
	SKU = tup[0]							#Get the SKU
	DES = re.sub('[\'\"\+\&\\-\.\(\)\*/\,#=]',' ',tup[1])	#Extract a clean description
	DES = re.sub(' +', ' ', DES)				#Remove multiple whitespaces from cleanup
	BUY = tup[2]							#Get the price
	OUT = OUT + '"' + SKU + '"|||"' + DES + '"|||' + BUY + '\n'		#Add a new line of data
	
OUT = OUT.strip('\n')						#Remove trailing newline character from last line
print(OUT)								#return the result to PAD

 

The entire PAD process looks like this:

 

20220713_111812_PAD.Designer_v5OLFRDh08.png

 

I hope that this helps anyone trying to deal with large quantities of poorly formatted CSV or TXT data. The process is relatively quick too - about 40 seconds for 16k+ lines.

 

Cheers

 

MisterH

Helpful resources

Announcements

Community will be READ ONLY July 16th, 5p PDT -July 22nd

Dear Community Members,   We'd like to let you know of an upcoming change to the community platform: starting July 16th, the platform will transition to a READ ONLY mode until July 22nd.   During this period, members will not be able to Kudo, Comment, or Reply to any posts.   On July 22nd, please be on the lookout for a message sent to the email address registered on your community profile. This email is crucial as it will contain your unique code and link to register for the new platform encompassing all of the communities.   What to Expect in the New Community: A more unified experience where all products, including Power Apps, Power Automate, Copilot Studio, and Power Pages, will be accessible from one community.Community Blogs that you can syndicate and link to for automatic updates. We appreciate your understanding and cooperation during this transition. Stay tuned for the exciting new features and a seamless community experience ahead!

Summer of Solutions | Week 4 Results | Winners will be posted on July 24th

We are excited to announce the Summer of Solutions Challenge!    This challenge is kicking off on Monday, June 17th and will run for (4) weeks.  The challenge is open to all Power Platform (Power Apps, Power Automate, Copilot Studio & Power Pages) community members. We invite you to participate in a quest to provide solutions to as many questions as you can. Answers can be provided in all the communities.    Entry Period: This Challenge will consist of four weekly Entry Periods as follows (each an “Entry Period”)   - 12:00 a.m. PT on June 17, 2024 – 11:59 p.m. PT on June 23, 2024 - 12:00 a.m. PT on June 24, 2024 – 11:59 p.m. PT on June 30, 2024 - 12:00 a.m. PT on July 1, 2024 – 11:59 p.m. PT on July 7, 2024 - 12:00 a.m. PT on July 8, 2024 – 11:59 p.m. PT on July 14, 2024   Entries will be eligible for the Entry Period in which they are received and will not carryover to subsequent weekly entry periods.  You must enter into each weekly Entry Period separately.   How to Enter: We invite you to participate in a quest to provide "Accepted Solutions" to as many questions as you can. Answers can be provided in all the communities. Users must provide a solution which can be an “Accepted Solution” in the Forums in all of the communities and there are no limits to the number of “Accepted Solutions” that a member can provide for entries in this challenge, but each entry must be substantially unique and different.    Winner Selection and Prizes: At the end of each week, we will list the top ten (10) Community users which will consist of: 5 Community Members & 5 Super Users and they will advance to the final drawing. We will post each week in the News & Announcements the top 10 Solution providers.  At the end of the challenge, we will add all of the top 10 weekly names and enter them into a random drawing.  Then we will randomly select ten (10) winners (5 Community Members & 5 Super Users) from among all eligible entrants received across all weekly Entry Periods to receive the prize listed below. If a winner declines, we will draw again at random for the next winner.  A user will only be able to win once overall. If they are drawn multiple times, another user will be drawn at random.  Individuals will be contacted before the announcement with the opportunity to claim or deny the prize.  Once all of the winners have been notified, we will post in the News & Announcements of each community with the list of winners.   Each winner will receive one (1) Pass to the Power Platform Conference in Las Vegas, Sep. 18-20, 2024 ($1800 value). NOTE: Prize is for conference attendance only and any other costs such as airfare, lodging, transportation, and food are the sole responsibility of the winner. Tickets are not transferable to any other party or to next year’s event.   ** PLEASE SEE THE ATTACHED RULES for this CHALLENGE**   Week 1 Results: Congratulations to the Week 1 qualifiers, you are being entered in the random drawing that will take place at the end of the challenge.   Community MembersNumber SolutionsSuper UsersNumber Solutions Deenuji 9 @NathanAlvares24  17 @Anil_g  7 @ManishSolanki  13 @eetuRobo  5 @David_MA  10 @VishnuReddy1997  5 @SpongYe  9JhonatanOB19932 (tie) @Nived_Nambiar  8 @maltie  2 (tie)   @PA-Noob  2 (tie)   @LukeMcG  2 (tie)   @tgut03  2 (tie)       Week 2 Results: Congratulations to the Week 2 qualifiers, you are being entered in the random drawing that will take place at the end of the challenge. Week 2: Community MembersSolutionsSuper UsersSolutionsPower Automate  @Deenuji  12@ManishSolanki 19 @Anil_g  10 @NathanAlvares24  17 @VishnuReddy1997  6 @Expiscornovus  10 @Tjan  5 @Nived_Nambiar  10 @eetuRobo  3 @SudeepGhatakNZ 8     Week 3 Results: Congratulations to the Week 3 qualifiers, you are being entered in the random drawing that will take place at the end of the challenge. Week 3:Community MembersSolutionsSuper UsersSolutionsPower Automate Deenuji32ManishSolanki55VishnuReddy199724NathanAlvares2444Anil_g22SudeepGhatakNZ40eetuRobo18Nived_Nambiar28Tjan8David_MA22   Week 4 Results: Congratulations to the Week 4 qualifiers, you are being entered in the random drawing that will take place at the end of the challenge. Week 4:Community MembersSolutionsSuper UsersSolutionsPower Automate Deenuji11FLMike31Sayan11ManishSolanki16VishnuReddy199710creativeopinion14Akshansh-Sharma3SudeepGhatakNZ7claudiovc2CFernandes5 misc2Nived_Nambiar5 Usernametwice232rzaneti5 eetuRobo2   Anil_g2   SharonS2  

Check Out | 2024 Release Wave 2 Plans for Microsoft Dynamics 365 and Microsoft Power Platform

On July 16, 2024, we published the 2024 release wave 2 plans for Microsoft Dynamics 365 and Microsoft Power Platform. These plans are a compilation of the new capabilities planned to be released between October 2024 to March 2025. This release introduces a wealth of new features designed to enhance customer understanding and improve overall user experience, showcasing our dedication to driving digital transformation for our customers and partners.    The upcoming wave is centered around utilizing advanced AI and Microsoft Copilot technologies to enhance user productivity and streamline operations across diverse business applications. These enhancements include intelligent automation, AI-powered insights, and immersive user experiences that are designed to break down barriers between data, insights, and individuals. Watch a summary of the release highlights.    Discover the latest features that empower organizations to operate more efficiently and adaptively. From AI-driven sales insights and customer service enhancements to predictive analytics in supply chain management and autonomous financial processes, the new capabilities enable businesses to proactively address challenges and capitalize on opportunities.    

Updates to Transitions in the Power Platform Communities

We're embarking on a journey to enhance your experience by transitioning to a new community platform. Our team has been diligently working to create a fresh community site, leveraging the very Dynamics 365 and Power Platform tools our community advocates for.  We started this journey with transitioning Copilot Studio forums and blogs in June. The move marks the beginning of a new chapter, and we're eager for you to be a part of it. The rest of the Power Platform product sites will be moving over this summer.   Stay tuned for more updates as we get closer to the launch. We can't wait to welcome you to our new community space, designed with you in mind. Let's connect, learn, and grow together.   Here's to new beginnings and endless possibilities!   If you have any questions, observations or concerns throughout this process please go to https://aka.ms/PPCommSupport.   To stay up to date on the latest details of this migration and other important Community updates subscribe to our News and Announcements forums: Copilot Studio, Power Apps, Power Automate, Power Pages

Users online (710)