How to convert Audio from Microphone Control of Po...

yashkamdar · ‎02-25-2020

Recently I wrote a nice blog here on how to Convert the Audio recorded from Microphone Control of Power Apps to Text by configuring a Power Automate solution which consumes Azure Cognitive Services.

Since I had to work with a number of components like Power Apps, Power Automate, Azure function, FFMpeg codec and Azure cognitive services to create this solution, I had to divide my blog into 3 parts.

Design a Canvas App with The Microphone Control to capture Audio.
Create an Azure Function to convert audio captured in Power Apps from WEBM to WAV format using FFmpe...
Create a Power Automate (Flow) to create an HTML file, using the text obtained from the output of th...

The main focus of this blog however is to understand how to design and configure a Power Automate solution to bring about conversion of Speech(Audio in Microphone) to Text with additional capabilities like creating an HTML file with the text obtained post conversion.

In addition to this we are going to add more power to the Power Automate solution by also -

Creating a Speech file in SharePoint which has the audio recorded in Microphone.
Creating HTML file with the text obtained from output of Cognitive services post conversion.
Converting the HTML file to PDF (most widely used document format for business processes).

Issue-

Whenever we record the audio in the Microphone control of Power Apps it always gets recorded in the WEBM format.
When we try to pass this audio recorded in WEBM format to Azure cognitive services so as to get it converted from Speech to Text we get an error as Unsupported File Format.
This is because, the Azure cognitive service only recognizes audio which is either in WAV or OGG formats. WEBM is not a supported format for Azure cognitive services.

Solution-

We’re going to use FFmpeg to convert the Microphone Audio in WEBM format to an audio file in WAV format, so we can pass that file to The Azure Speech to Text Cognitive Services.
Simply put, we’re going to make use of an Azure function to build a simple API, which will do the work of converting a WEBM file to a WAV file for us . This API will be making use of FFmpeg to do the actual conversion itself.
FFmpeg is basically an Audio and Video format converter.

Prerequisites-

Before you begin, please make sure the following prerequisites are in place:

An Office 365 subscription with access to PowerApps and Power Automate (Flow).
An Azure Subscription to create a Function App.
Muhimbi PDF Converter Services Online full, free or trial subscription (Start trial)
Power Automate Plan for consuming the HTTP Custom actions.
Appropriate privileges to create Power Apps and Power Automate(Flow).
Working knowledge of both Power Apps and Power Automate(Flow).

Now that we have all the prerequisites ready let's starting designing our Power Automate (Flow).

Step 1 - Trigger

Create a new flow and select trigger as Power Apps.

Step 2 - Add a compose action

Add a compose action and in the "Inputs" field select from Dynamic content "Ask in PowerApps".

Step 3 - Add a Parse JSON action

Add a Parse JSON action and in the "Content" field add from Dynamic content "Outputs" of the Compose action created above.
Click on "Generate from Sample" and add the following piece of code to generate the schema.

{
    "type": "object",
    "properties": {
        "Url": {
            "type": "string"
        }
    }
}

Step 4 - Add Compose action

Add another compose action
From the Expression select the method "dataURiToBinary".
Keeping this value intact now select the Dynamic content and select the "URl" property.
In the expression you will now see and expression as "dataUriToBinarybody('Parse_JSON')?['properties']?['Url']"
Make sure to apply round brackets to the body parameter to make the expression syntactically correct as follows-

dataUriToBinarybody(('Parse_JSON')?['properties']?['Url'])

Click on Ok and you will now see the expression getting configured as below.

Step 5 - Create file in SharePoint action

Add a "Create file in SharePoint" action.
Select the "Site Address" and "Folder Path" where you intend to save the audio file.
Give the File a meaningful name and do not forget to save the extension with ".wav" format.
In the "File content" pass the "Outputs" of "Compose2" action configured earlier above.

Important note-

Your speech file is now successfully created in SharePoint that holds the audio recorded in the Power Apps.
The next steps will be calling the API that configured using the Azure function that will convert the WEBM audio format to WAV format.
To get details on how to configure the Azure function check out the blog here.

Step 6 - HTTP Post action

Add a HTTP request action with method as "POST".
The URI is the function URL that you should get in the Azure portal where you have configured your Azure function.
In the "Body" field, add the same "DataURItoBinary" expression that we entered in the "Compose2" action.

Step 7 - HTTP Post action

Add a HTTP request action with method as "POST".
Use the same URI as mentioned in the screenshot below where you would just need to make a small change in the region as - https://<region in which cognitive service is hosted>.stt.speech.microsoft.com/sp.....
In my case the region was "WestEurope".
You will need to pass "Ocp-Apim-Subscription-Key" which you should get when you created a Speech services in Azure portal and "Content-type" as "audio/wav".
In the "Body" field pass the response "Body" inside the Dynamic content obtained from HTTP request configured earlier.

Step 8 - Add a Parse JSON action

Next we need to parse the response obtained from The Cognitive Services API in order to extract the Text .
Select "Body" from the dynamic content and include it inside the "Content"
For generating a schema, please use the payload as shown below:

{
    "type": "object",
    "properties": {
        "NBest": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "Confidence": {
                        "type": "number"
                    },
                    "Lexical": {
                        "type": "string"
                    },
                    "ITN": {
                        "type": "string"
                    },
                    "MaskedITN": {
                        "type": "string"
                    },
                    "Display": {
                        "type": "string"
                    }
                },
                "required": [
                    "Confidence",
                    "Lexical",
                    "ITN",
                    "MaskedITN",
                    "Display"
                ]
            }
        }
    }
}

Step 9 - Create PDF using the text-

Before we go ahead and add a ‘Convert HTML to PDF’ action to grab the extracted text and convert it to a PDF file using The Muhimbi Converter, I want to focus your attention to the output obtained from The Cognitive Services API action.
As you can see, the Lexical parameter is preserving our Speech to text output, so we need to go ahead and pass the Lexical parameter as the Source for generating a PDF file.
Go ahead and add the ‘Convert HTML to PDF’ action to the Flow. That’s right no need to add an ‘Apply to each’ action, as it will be added automatically.
The reason the ‘Apply to each’ action gets added, is because The Cognitive Services API is exposing a lot of parameters in the response, each of which holds data in a specific format like Lexical,ITN,MaskedITN etc..
We will be needing the Lexical parameter and hence we will pass Lexical from the dynamic content as shown below.
If you have doubts over how The Muhimbi Converter’s ‘Convert HTML to PDF’ action works, please check here.