Hacking together an Alexa skill

24. December 2018 2018 0

Alexa is an Amazon technology that allows users to build voice-driven applications. Amazon takes care of converting voice input to text and vice versa, provisioning the software to devices, and calling into your business logic. You use the Amazon interface to build your model and provide the logic that executes based on the text input. The combination of the model and the business logic is an Alexa skill. Alexa skills run on a variety of devices, most prominently the Amazon Echo.

I built an Alexa skill as a proof of concept for a hackathon; I had approximately six hours to build something to demo. My goals for the proof of concept were to:

  • Run without error in the simulator
  • Pull data from a remote service using an AWS lambda function
  • Take input from a user
  • Respond to that input

I was working with horse racing data because it is both timely and I had access to an API that provided interesting data. Horse races happen at a specific track on a specific date and time. Each race has a number of horses that compete.

The flow of my Alexa skill was:

  • Notify Alexa to call into the custom skill using a phrase.
  • Prompt the user to choose a track from one of N provided by me.
  • Store the value of the track for future operations.
  • Store the track name in the session.
  • Prompt the user to choose between two sets of information that might be of use: the number of races today or the date of the next featured race.
  • Return the requested information.
  • Exit the skill, which meant that Alexa was no longer listening for voice input.

The barriers to entry of creating a proof of concept are low. If you can write a python script and navigate around the AWS console, you can write an Alexa skill. However, there are some other considerations that I didn’t have to work through because this was a hackathon project. Tasks including UX, testing, and deployment to a device would be crucial to any production project, however.

Jargon and Terminology

Like any other technology, Alexa has its own terminology. And there’s a lot of it.

A skill is a package of a model to convert voice to text and vice versa as well as business logic to execute against the text input. A skill is accessed by a phrase the user says, like “listen to NPR” or “talk horse racing.” This phrase is an “invocation.” The business logic is written by you, but the Alexa service handles the voice to text and text to voice conversion.

A skill is made of up of one or more intents. An intent is one branch of logic and is also identified by a phrase, called an utterance. While invocations need to be globally unique, utterances only trigger after a skill is invoked so the phrasing can overlap between different skills. An example intent utterance would be “please play ‘Fresh Air’” or “my favorite track is Arapahoe Park.” You can also use a prepackaged intent, such as one that returns search results, and tie that to an utterance. Utterances are also called samples.

Slots are placeholders within utterances. If the intent phrase is “please play ‘Fresh Air’” you can parameterize the words ‘Fresh Air’ and have that phrase converted to text and delivered to you. A slot is basically a multiple choice selection, so you can provide N values and have the text delivered to you. Each slot has a defined data type. It was unclear to me what happens when a slot is filled with a value that is not one of your N values. (I didn’t get a chance to test that.)

A session is a place for your business logic to store temporary data between invocations of intents. A session is tied both to an application and a user (more info here). Sessions stay around for about the length of time a user is interacting with your application. Depending on application settings it will be about 30 seconds. If you need to store data for longer, connect your business logic to a durable storage solution like S3 or DynamoDB.

Getting started

To get started, I’d suggest using this tutorial as a foundation. If you are looking for a different flow, look at this set of tutorials and see if any of them match your problem space. All of them walk you through creating an Alexa skill using a python lambda function. It’s worth noting that you’ll have to sign up with an Amazon account for access to the Alexa console (it’s a different authentication system than AWS IAM). I’d also start out using a lambda function to eliminate a possible complication. If you use lambda, you don’t have to worry about making sure Alexa can access your https endpoint (the other place your business logic can reside).

Once you have the tutorial working in the Alexa console, you can start extending the main components of the Alexa skill: the model or the business logic.

Components

You configure the model using the Alexa console and the Alexa skills kit or via the CLI or skills kit API. In either case, you’re going to end up with a JSON configuration file with information about the invocation phrase, the intents and the slots. You also can trigger a model build and test your model using a browser when using the console, as long as you have a microphone.

Here are selected portions of the JSON configuration file for the Alexa skill I created. You can see this was a proof of concept as I didn’t delete the color scheme from the tutorial and only added two tracks that the user can select as their favorite.

{  
   "interactionModel":{  
      "languageModel":{  
         "invocationName":"talk horse racing",
         "intents":[  
            {  
               "name":"MyColorIsIntent",
               "slots":[  
                  {  
                     "name":"TrackName",
                     "type":"TrackNameType"
                  }
               ],
               "samples":[  
                  "my favorite track is {TrackName}"
               ]
            },
            {  
               "name":"AMAZON.HelpIntent",
               "samples":[  

               ]
            },
            {  
               "name":"HowManyRaces",
               "slots":[  

               ],
               "samples":[  
                  "how many races"
               ]
            },
            {  
               "name":"NextStakesRace",
               "slots":[  

               ],
               "samples":[  
                  "when is the stakes race",
                  "when is the next stakes race"
               ]
            }
         ],
         "types":[  
            {  
               "name":"LIST_OF_COLORS",
               "values":[  
                  {}
               ]
            },
            {  
               "name":"TrackNameType",
               "values":[  
                  {  
                     "name":{  
                        "value":"Arapahoe Park"
                     }
                  },
                  {  
                     "name":{  
                        "value":"Tampa Bay Downs"
                     }
                  }
               ]
            }
         ]
      }
   }
}

The other component of the system is business logic. This can either be an AWS Lambda, written in any language supported by that service, or service that responds to an HTTPS request. That could be useful in leveraging existing code or data, not in AWS. If you use Lambda, you can deploy the skill just like any other Lambda, which means you can leverage whatever lifecycle, frameworks or testing solution you use for other Lambda functions. Using a non-AWS Lambda solution requires a bit more work when processing a request, but it can be done.

The business logic I wrote for this was basically hacked tutorial code. The first section is the lambda handler. Below is a relevant snippet where we examine the event passed to the lambda function by the Alexa system and call the appropriate business method.

def lambda_handler(event, context):

    if event['session']['new']:

       on_session_started({'requestId': event['request']['requestId']},

                         event['session'])

    if event['request']['type'] == "LaunchRequest":

       return on_launch(event['request'], event['session'])

    elif event['request']['type'] == "IntentRequest":

       return on_intent(event['request'], event['session'])

   elif event['request']['type'] == "SessionEndedRequest":

       return on_session_ended(event['request'], event['session'])

...

on_intent is the logic dispatcher which retrieves the intent name and then calls the appropriate internal function.

def on_intent(intent_request, session):

   """ Called when the user specifies an intent for this skill """

   print("on_intent requestId=" + intent_request['requestId'] +

         ", sessionId=" + session['sessionId'])

   intent = intent_request['intent']

   intent_name = intent_request['intent']['name']

    if intent_name == "MyColorIsIntent":

       return set_color_in_session(intent, session)

…

    elif intent_name == "HowManyRaces":

       return get_how_many_races(intent, session)

…

Each business logic function can be independent and could call into different services if need be.

def get_how_many_races(intent, session):

   session_attributes = {}

   reprompt_text = None

    # Setting reprompt_text to None signifies that we do not want to reprompt

    # the user. If the user does not respond or says something that is not

    # understood, the session will end.

    if session.get('attributes', {}) and "favoriteColor" in session.get('attributes', {}):

       favorite_track = session['attributes']['favoriteColor']

       speech_output = "There are " + get_number_races(favorite_track) + " races at " +favorite_track + " today. Thank you, good bye."

       should_end_session = True

   else:

       speech_output = "Please tell me your favorite track by saying, " \

                   "my favorite track is Arapahoe Park"

       should_end_session = False

   

   return build_response(session_attributes, build_speechlet_response(

       intent['name'], speech_output, reprompt_text, should_end_session))

build_response is directly from the sample code and creates a correctly formatted string response. This response will be interpreted by Alexa and converted into speech.

def build_response(session_attributes, speechlet_response):

   return {

       'version': '1.0',

       'sessionAttributes': session_attributes,

       'response': speechlet_response

    }

Based on the firm foundation of the tutorial, you can easily add more slots, intents and change the invocation. You also can build out additional business logic that can respond to the additional voice input.

Testing

I tested my skill manually using the built-in simulator in the Alexa console. I tried other simulators, but they were not effective. At the bottom of the python tutorial mentioned above, there is a reference to echosim.io, which is an online Alexa skill simulator; I couldn’t seem to make it work.

After each model change (new or modified utterances, intents or invocations) you will need to rebuild the model (approximately 30-90 seconds, depending on the complexity of your model). Changing the business logic does not require rebuilding the model, and you can use that functionality to iterate more quickly.

I did not investigate automated testing. If I were building a production Alexa skill, I’d add a small layer of indirection so that the business logic could be easily unit tested, apart from any dependencies on Alexa objects. I’d also plan to build a CI/CD pipeline so that changes to the model or the lambda function, something like what is outlined here.

User Experience (UX)

Voice UX is very different from the UX of desktop or a mobile device. Because information transmission is slow, it’s even more important to think about voice UX for an Alexa skill than it would be if you were building a more traditional web-based app. If you are building a skill for any other purpose than exploration or proof of concept, make sure to devote some time to learning about voice UX. This webinar appears useful.

Some lessons I learned:

  • Don’t go too deep in navigation level. With Alexa, you can provide choice after choice for the user, but remember the last time you dealt with an interactive phone voice recognition system. Did you like it? Keep interactions short.
  • Repeat back what Alexa “heard” as this gives the user another chance to correct course.
  • Offer a help option. If I were building a production app, I’d want to get some kind of statistics on how often the help option was invoked to see if there was an oversight on my part.
  • Think about error handling using reprompts. If the skill hasn’t received input, it can reprompt and possibly get more user input.

After the simulator

A lot of testing and development can take place in the Amazon Alexa simulator. However, at some point, you’ll need to deploy to a device. Since this was a proof of concept, I didn’t do that, but there is documentation on that process here.

Conclusion

This custom Alexa skill was the result of a six-hour exploration during a company hackfest. At the end of the day, I had a demo I could run on the Alexa Simulator. Alexa is mainstream enough that it makes sense for anyone who works with timely, textual information to evaluate building a skill, especially since a prototype can be built relatively quickly. For instance, it seems to me that a newspaper should have an Alexa skill, but it doesn’t make as much sense for an e-commerce store (unless you have certain timely information and a broad audience) because complex navigation is problematic. Given the low barrier to entry, Alexa skills are worth exploring as this new UX for interacting with computers becomes more prevalent.

About the Author

Dan Moore is director of engineering at Culture Foundry. He is a developer with two decades of experience, former AWS trainer, and author of “Introduction to Amazon Machine Learning,” a video course from O’Reilly. He blogs at http://www.mooreds.com/wordpress/ . You can find him on Twitter at @mooreds.

About the Editors

Ed Anderson is the SRE Manager at RealSelf, organizer of ServerlessDays Seattle, and occasional public speaker. Find him on Twitter at @edyesed.

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Alexa is checking your list

20. December 2016 2016 0

Author: Matthew Williams
Editors: Benjamin Marsteau, Scott Francis

Recently I made a kitchen upgrade: I bought an Amazon Dot. Alexa, the voice assistant inside the intelligent puck, now plays a key role in the preparation of meals every day. With both hands full, I can say “Alexa, start a 40-minute timer” and not have to worry about burning the casserole. However, there is a bigger problem coming up that I feel it might also help me out on. It is the gift-giving season, and I have been known to get the wrong things. Wouldn’t it be great if I could have Alexa remind me what I need to get for each person on my list? Well, that simple idea took me down a path that has consumed me for a little too long. And as long as I built it, I figured I would share it with you.

Architecting a Solution

Now it is important to remember that I am a technologist and therefore I am going to go way beyond what’s necessary. [ “anything worth doing is worth overdoing.” — anon. ] Rather than just building the Alexa side of things, I decided to create the entire ecosystem. My wife and I are the first in our families to add Alexa to their household, so that means I need a website for my friends and family to add what they want. And of course, that website needs to talk to a backend server with a REST API to collect the lists into a database. And then Alexa needs to use that same API to read off my lists.

OK, so spin up an EC2 instance and build away, right? I did say I am a technologist, right? That means I have to use the shiniest tools to get the job done. Otherwise, it would just be too easy.

My plan is to use a combination of AWS Lambda to serve the logic of the application, the API Gateway to host the REST endpoints, DynamoDB for saving the data, and another Lambda to respond to Alexa’s queries.

The Plan of Attack

Based on my needs, I think I came up with the ideal plan of attack. I would tackle the problems in the following order:

  1. Build the Backend – The backend includes the logic, API, and database.
    1. Build a Database to Store the Items
    2. Lambda Function to Add an Item
    3. Lambda Function to Delete an Item
    4. Lambda Function to List All Items
    5. Configure the API Gateway
  2. Build the User Interface – The frontend can be simple: show a list, and let folks add and remove from that list.
  3. Get Alexa Talking to the Service – That is why we are here, right?

There are some technologies used that you should understand before beginning. You do not have to know everything about Lambda or the API Gateway or DynamoDB, but let’s go over a few of the essentials.

Lambda Essentials

The purpose of Lambda is to run the functions you write. Configuration is pretty minimal, and you only get charged for the time your functions run (you get a lot of free time). You can do everything from the web console, but after setting up a few functions, you will want another way. See this page for more about AWS Lambda.

API Gateway Essentials

The API Gateway is a service to make it easier to maintain and secure your APIs. Even if I get super popular, I probably won’t get charged much here as it is $3.50 per million API calls. See this page for more about the Amazon API Gateway.

DynamoDB Essentials

DynamoDB is a simple (and super fast) NoSQL database. My application has simple needs, and I am going to need a lot more friends before I reach the 25 GB and 200 million requests per month that are on the free plan. See this page for more about Amazon DynamoDB.

Serverless Framework

Sure I can go to each service’s console page and configure them, but I find it a lot easier to have it automated and in source control. There are many choices in this category including the Serverless framework, Apex, Node Lambda, and many others. They all share similar features so you should review them to see which fits your needs best. I used the Serverless framework for my implementation.

Alexa Skills

When you get your Amazon Echo or Dot home, you interact with Alexa, the voice assistant. The things that she does are Alexa Skills. To build a skill you need to define a list of phrases to recognize, what actions they correspond to, and write the code that performs those actions.

Let’s Start Building

There are three main components that need to be built here: API, Web, and Skill. I chose a different workflow for each of them. The API uses the Serverless framework to define the CloudFormation template, Lambda Functions, IAM Roles, and API Gateway configuration. The Webpage uses a Gulp workflow to compile and preview the site. And the Alexa skill uses a Yeoman generator. Each workflow has its benefits and it was exciting to use each.

If you would like to follow along, you can clone the GitHub repo: https://github.com/DataDog/AWS-Advent-Alexa-Skill-on-Lambda.

Building the Server

The process I went through was:

  1. Install Serverless Framework (npm i -g serverless)
  2. Create the first function (sls create -n <service name> -t aws-nodejs)The top-level concept in Serverless is that of a service. You create a service, then all the Lambda functions, CloudFormation templates, and IAM roles defined in the serverless.yaml file support that service.Add the resources needed to a CloudFormation template in the serverless.yaml file. For example:Refer to the CloudFormation docs and the Serverless Resources docs for more about this section.
  3. Add the resources needed to a CloudFormation template in the serverless.yaml file. For example:
    alexa_1
    Refer to the CloudFormation docs and the Serverless Resources docs for more about this section.
  4. Add the IAM Role statements to allow your Lambda access to everything needed. For example:
    alexa_2
  5. Add the Lambda functions you want to use in this service. For example:
    alexa_3
    The events section lists the triggers that can kick off this function. **http** means to use the API Gateway. I spent a little time in the API Gateway console and got confused. But these four lines in the serverless.yaml file were all I needed.
  6. Install serverless-webpack npm and add it to the YAML file:
    alexa_4
    This configuration tells Serverless to use WebPack to bundle all your npm modules together in the right way. And if you want to use EcmaScript 2015 this will run Babel to convert back down to a JavaScript version that Lambda can use.  You will have to setup your webpack.config.js and .babelrc files to get everything working.
  7. Write the functions. For the function I mentioned earlier, I added the following to my items.js file:
    alexa_5
    This function sets the table name in my DynamoDB and then grabs all the rows. No matter what the result is, a response is formatted using this createResponse function:
    alexa_6Notice the header. Without this, Cross Origin Resource Sharing will not work. You will get nothing but 502 errors when you try to consume the API.
  8. Deploy the Service:

    Now I use 99Design’s aws-vault to store my AWS access keys rather than adding them to a rc file that could accidentally find its way up to GitHub. So the command I use is:

    If everything works, it creates the DynamoDB table, configures the API Gateway APIs, and sets up the Lambdas. All I have to do is try them out from a new application or using a tool like Paw or Postman. Then rinse and repeat until everything works.

Building the Frontend

alexa_7

Remember, I am a technologist, not an artist. It works, but I will not be winning any design awards. It is a webpage with a simple table on it and loads up some Javascript to show my DynamoDB table:

alexa_8

Have I raised the technologist card enough times yet? Well, because of that I need to keep to the new stuff even with the Javascript features I am using. That means I am writing the code in ECMAScript 2015, so I need to use Babel to convert it to something usable in most browsers. I used Gulp for this stage to keep building the files and then reloading my browser with each change.

Building the Alexa Skill

Now that we have everything else working, it is time to build the Alexa Skill. Again, Amazon has a console for this which I used for the initial configuration on the Lambda that backs the skill. But then I switched over to using Matt Kruse’s Alexa App framework. What I found especially cool about his framework was that it works with his alexa-app-server so I can test out the skill locally without having to deploy to Amazon.

For this one I went back to the pre-ECMAScript 2015 syntax but I hope that doesn’t mean I lose technologist status in your eyes.

Here is a quick look at a simple Alexa response to read out the gift list:

alexa_9

Summary

And now we have an end to end solution around working with your gift lists. We built the beginnings of an API to work with gift lists. Then we added a web frontend to allow users to add to the list. And then we added an Alexa skill to read the list while both hands are on a hot pan. Is this overkill? Maybe. Could I have stuck with a pen and scrap of paper? Well, I guess one could do that. But what kind of technologist would I be then?

About the Author

Matt Williams is the DevOps Evangelist at Datadog. He is passionate about the power of monitoring and metrics to make large-scale systems stable and manageable. So he tours the country speaking and writing about monitoring with Datadog. When he’s not on the road, he’s coding. You can find Matt on Twitter at @Technovangelist.

About the Editors

Benjamin Marsteau is a System administrator | Ops | Dad | and tries to give back to the community has much as it gives him.