Using Cloud-Based AI Keyword Analysis for Text Optimization
Recently I’ve seen a lot of discussion and debate about the use of Applicant Tracking Systems and how they parse resumes for keywords to try to determine if a candidate is a good fit for a job posting. Regardless of one’s thoughts on the systems, for better or for worse, the reality is they are widely used today. Using public cloud-based artificial intelligence I was curious to see if you could use these online services to try to better your odds at matching for keyphrases an employer or recruiter might use for jobs you’d be interested in applying for. To put each companies respective engines to the test and have a better idea for the strengths and weaknesses of this technology I also ran a simple tweet, a newspaper editorial, and the bulk of a poem through each. I was somewhat surprised at the outcomes.
I’ll go into specifics in the results section of each service but all in all I most pleased with the results from IBM Watson. Surprisingly Azure came in second best for me with Google and AWS falling short in my opinion.
You may also notice some similarity with a previous post, Detecting Sentiment and Emotion in Text, but that’s unavoidable as the keyword or keyphrase functionality is tied in with all the related natural language processing services. However, these functions serve very different purposes. What the text is about rather than how the text may make one feel. Although, some of the services will return sentiment along with their keywords as I’ll note where appropriate.
Below are the 4 samples of text I used to test each of the platforms’ artificial intelligence offerings:
Sample #1 is from my LinkedIn bio:
I am a well-rounded, cloud focused, IT professional. I’m comfortable in any role from contributor to manager. I’m willing to grow within an organization to achieve the goal of returning to a manager/director role in a cloud-focused company.
My roles and interests encompass a wide range of responsibilities that provide me with expertise in many areas including cloud solutions offered by Amazon Web Services and Microsoft Azure, security, domains and DNS, web technologies (servers and development), and legal compliance. My experience also includes responsibilities for developing code to interact with cloud platform APIs, Iaas, DaaS, HPE Helion, CloudSystem 9 and 10, Scality, Stratoscale Symphony, Google Analytics, Google Adwords, etc.
Throughout my 10 years of leadership positions some of the numerous initiatives I have been responsible for include:
• Collecting and correlating data to design an anti-fraud solution that decreased fraud by 86%
• Reorganized and led a team of 45 agents increasing service levels from 45% to 90% in less than 6 weeks
• The design & implementation of a global load balancing solution that improved page load times by up to 500% depending on global locationProgram Management and technical project management has allowed me to work with world-class customers delivering complex multi-million-dollar solutions on a timely basis. In doing so, I work with numerous vendors and inside team members from executive management to professional services, engineers, and developers.
Sample #2 is the first nearly 5000 characters of Edgar Allen Poe’s “The Raven“.
Sample #3 is the text of an editorial on cryptocurrencies by The Guardian.
Sample #4 is this tweet about a cute dog:
Meet Yogi. He made his first trip to the beach today. Could never have guessed eating sand would be so exhausting. 12/10 would bring back soon pic.twitter.com/8dnaCObcZ4
— WeRateDogs™ (@dog_rates) February 19, 2018
Let’s pull back the curtain and see what each service reports for our examples.
AWS Comprehend
Features
AWS offers the following features as part of its Comprehend natural language processing service. In addition to the overall sentiment detected, the Sentiment Analysis function will give you scores for each possible value to show you how certain it is of its decision, out to 14 decimal places. I’ll be rounding that up a bit for the sake of your eyes. For this post, I only experimented with the Keyphrase Extraction function.
Keyphrase Extraction: The Keyphrase Extraction API returns the key phrases or talking points and a confidence score to support that this is a key phrase.
Sentiment Analysis: The Sentiment Analysis API returns the overall sentiment of a text (Positive, Negative, Neutral, or Mixed).
Entity Recognition: The Entity Recognition API returns the named entities (“People,” “Places,” “Locations,” etc.) that are automatically categorized based on the provided text.
Language Detection: The Language Detection API automatically identifies text written in over 100 languages and returns the dominant language with a confidence score to support that a language is dominant.
Topic Modeling: Topic Modeling identifies relevant terms or topics from a collection of documents stored in Amazon S3. It will identify the most common topics in the collection and organize them in groups and then map which documents belong to which topic.
Pricing
Natural Language Processing requests are measured in units of 100 characters, with a 3 unit (300 characters) minimum charge per request. The free tier gives you access to 50,000 units of text (about 5 million characters) for each of the APIs per month with each additional unit starting at $0.0001 with significant cost reductions based on volume. Topic Modeling is the exception as it is priced per job.
I like Amazon’s price structure as it gives you quite a lot of wiggle room to play and remains inexpensive to continue using. Complete pricing can be found via the appropriate link in the Resources box above.
Limits
For my tests, only the first limit really applied. It was sufficient for my needs and seems large enough to accommodate most tasks you’d need it for. I included all limits, as of writing this post, for reference:
- The maximum document size is 5,000 bytes of UTF-8 encoded characters.
- The maximum number of documents for the BatchDetectDominantLanguage, BatchDetectEntities, BatchDetectKeyPhrases, and BatchDetectSentiment operations is 25 documents per request.
- The BatchDetectDominantLanguage and DetectDominantLanguage operations have the following limitations:
- They don’t support phonetic language detection. For example, they will not detect “arigato” as Japanese, nor “nihao” as Chinese.
- They may have trouble distinguishing close language pairs, such as Indonesian and Malay; or Bosnian, Croatian, and Serbian.
- For best results, the input text should be at least 20 characters long.
API
With Amazon, you really have to install and use either the CLI or language-specific SDK. I used the PHP SDK and found it really easy to install and quick to code against – not to mention clean. For the authentication in the SDK to be successful please be sure your system’s clock/time is set correctly.
<?php // Setup and install your SDK first! For PHP you do that with composer require 'vendor/autoload.php'; // Provide you AWS API keys, the user for these keys need to be given permissions to Comprehend $aws_access_key = ''; $aws_secret_access_key = ''; // The text you want to analyze $text = ''; // Do the work and get your result $aws = new Aws\Sdk([ 'version' => 'latest', 'region' => 'us-east-1', 'credentials' => [ 'key' => $aws_access_key, 'secret' => $aws_secret_access_key, ], 'Comprehend' => [ 'region' => 'us-east-1' ], ]);; $comprehend = $aws->createComprehend(); $aws_result = $comprehend->detectKeyPhrases([ 'LanguageCode' => 'en', 'Text' => $text ]); // $aws_result will be an array you can access for the returned data echo 'AWS Results' . PHP_EOL; echo 'Keyphrases: ' . PHP_EOL; foreach ( $aws_result['KeyPhrases'] as $i ) { echo $i['Text'] . ' (' . $i['Score'] . ')' . PHP_EOL; } // Alternately, if you'd like to see the full set of data returned print_r($aws_result); ?>
Results
I found AWS seems to just run through each sentence and pick out each keyword it runs across. This results in a lot of keyword duplication. That’s fine if you are looking at relevance within a sentence or a small block of text but I personally would be looking at the document as a whole rather than chunked. I also note that Comprehend tends to be very loose on keyphrase detection (“Amazon Web Services and Micrsoft Azure”), tends to include some punctuation in keywords (see sample #2), and some words or phrases that I would have picked out either were not returned as important or had low confidence scores.
As mentioned above I’ve rounded the 14 decimal places up for readability. You get back a collection of key phrases that Amazon Comprehend identified from the input text. For each key phrase, the response provides the text of the keyword or phrase, where the keyphrase begins and ends, and the level of confidence that Amazon Comprehend has in the accuracy of the detection. I believe the score is from 0 to 1.0 with higher values indicating a higher confidence but I didn’t see this specifically mentioned in the documentation; I’m basing it on how AWS documents sentiment scores on other functions in the Comprehend service. Below I only list the first 20 keyphrases for each sample along with the confidence score for each.
Sample #1 – LinkedIn Bio | Sample #2 – The Raven | Sample #3 – Editorial | Sample #4 – Dog Tweet |
a well-rounded, cloud focused (0.895357) | a midnight dreary (0.856855) | Last month (0.990694) | Yogi (0.779558) |
any role (0.991994) | many a quaint and curious volume (0.884559) | a plague (0.998621) | his first trip (0.999373) |
contributor (0.986457) | forgotten lore (0.920504) | kittens (0.999314) | the beach (0.932810) |
manager (0.988315) | nearly napping (0.721991) | the most fashionable cryptocurrencies (0.998065) | today (0.994661) |
an organization (0.9967615) | a tapping (0.996199) | the internet (0.995693) | sand (0.915146) |
the goal (0.999682) | some one (0.849618) | news (0.983833) | 12/10 (0.981981) |
a manager/director role (0.9908485) | my chamber door (0.881901) | the cryptocurrency (0.72025179862976) | |
a cloud-focused company (0.997656) | “’Tis (0.746844) | Ethereum (0.933024) | |
My roles and interests (0.888028) | some visitor (0.985374) | bills (0.744276) | |
a wide range (0.999404) | my chamber door (0.898626) | the world computer” (0.847476) | |
responsibilities (0.999585) | Only this (0.714963) | a distributed program (0.993759) | |
expertise (0.996745) | the bleak December (0.984577) | large parts (0.999525) | |
many areas (0.998761) | each separate dying ember (0.912081) | both the legitimate banking system (0.948817) | |
cloud solutions (0.998675) | its ghost (0.998404) | the legal system (0.998754) | |
Amazon Web Services and Microsoft Azure (0.934452) | the floor (0.996622) | contracts (0.934160) | |
security (0.934451) | the morrow (0.996977) | computer code (0.997170) | |
domains (0.898301) | my books surcease (0.957688) | Ethereum (0.986565) | |
DNS (0.667329) | sorrowsorrow (0.996818) | the plaything (0.997950) | |
web technologies (0.992197) | skip all the way down to spot 69… | skip down to spot 115, the last items returned… | |
servers and development (0.856500) | a stately Raven (0.879681) | the cryptokittens (0.977406) |
Azure Text Analytics
Resources: |
Text Analytics Product Page |
Text Analytics Documentation |
Text Analytics API Docs |
Text Analytics Pricing |
Features
Azure’s feature set in Text Analytics is considerably lighter than the competing services. It does what I was looking for so I can’t complain about that but if you’re looking for more functionality you may need to look at IBM or AWS.
Sentiment analysis
The API returns a numeric score between 0 and 1. Scores close to 1 indicate positive sentiment, and scores close to 0 indicate negative sentiment. Sentiment score is generated using classification techniques. The input features of the classifier include n-grams, features generated from part-of-speech tags, and word embeddings. It is supported in a variety of languages.
Key phrase extraction
The API returns a list of strings denoting the key talking points in the input text. We employ techniques from Microsoft Office’s sophisticated Natural Language Processing toolkit. English, German, Spanish, and Japanese text are supported.
Language detection
The API returns the detected language and a numeric score between 0 and 1. Scores close to 1 indicate 100% certainty that the identified language is true. A total of 120 languages are supported.
Pricing
The free tier gives you up to 5,000 transactions. Once you go beyond that mark be prepared to pay $75 for the next tier. There is no per transaction incremental charge so you’ll see big jumps in fees for potentially small changes in your usage. Each “document” analyzed for each API call will be considered a transaction. Honestly, the subscription model they offer would likely keep me from using Azure’s Text Analytics for any small projects that I didn’t have an immediate method to monetize.
Limits
I do like that Azure will let you push a lot of text through at one time to help reduce the likelihood of being rate limited. Their “document” size is in-line with Amazon’s limits as well.
- Maximum size of a single document 5,000 characters as measured by String.Length.
- Maximum size of entire request 1 MB
- Maximum number of documents in a request 1,000 documents
- The rate limit is 100 calls per minute. Note that you can submit a large quantity of documents in a single call (up to 1000 documents).
API
With a little Googling, I found a PHP SDK for Azure but it seems to only support the base IaaS services right now. I was going for consistency with using a single programming language in this post so I had to use Azure’s REST API. Authentication is handled via passing your API key via a header in your HTTP POST.
<?php // Provide your Azure API key to access the Text Analytics service $azure_key_1 = ''; $azure_endpoint = 'https://eastus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases'; // The text you want to analyze $text = ''; // Do the work and get your result $data = array( 'documents' => array( array( 'language' => 'en', 'id' => '1', 'text' => "$text" ) ) ); $data = json_encode($data); $azure = curl_init(); curl_setopt($azure, CURLOPT_URL, $azure_endpoint); curl_setopt($azure, CURLOPT_POST, 1); curl_setopt($azure, CURLOPT_RETURNTRANSFER, 1); curl_setopt($azure, CURLOPT_HTTPHEADER, array( "Ocp-Apim-Subscription-Key: $azure_key_1", 'Content-Type: application/json', 'Accept: application/json' )); curl_setopt($azure, CURLOPT_POSTFIELDS, $data); $response = curl_exec($azure); $azure_result = json_decode($response, true); curl_close($azure); // $azure_result will be an array you can access for the returned data echo 'Azure Results:' . PHP_EOL; foreach ( $azure_result['documents'][0]['keyPhrases'] as $i ) { echo $i . PHP_EOL; } // Alternately, if you'd like to see the full set of data returned print_r($azure_result); ?>
Results
For the sentiment tests, I wasn’t crazy about lack of detail from Azure. However, in this instance, I’m relatively happy with their keyword results. I received the list of keyphrases I was looking for and, in my opinion, their parsing is fairly tight and pretty accurate as far as what I would pick out. However, I felt some interesting keywords were missed or too broadly lumped in as a phrase (I’m thinking about ravens and cryptokittens here). They stripped out irrelevant punctuation from words and seemingly avoided duplication, which I liked. Don’t get me wrong, some results are still a little odd but at least Azure doesn’t break Amazon away from Web Services or lump it in as part of a larger phrase. I’m only listing the first 20 keywords returned for each sample.
Sample #1 – LinkedIn Bio | Sample #2 – The Raven | Sample #3 – Editorial | Sample #4 – Dog Tweet |
cloud solutions | chamber door | code of Ethereum | trip |
cloud-focused company | tapping | software | beach |
professional services | ominous bird of yore | week | sand |
executive management | ebony bird | ignorance of computer code | Yogi |
cloud platform APIs | lost Lenore | world computer | |
Program Management | memories of Lenore | wake of bitcoin | |
fraud solution | stately Raven | fashionable cryptocurrencies | |
manager | whispered word | thousands of cryptocurrencies | |
design | sad fancy | legitimate banking system | |
global load balancing solution | Tis | legal system | |
Amazon Web Services | soul | modern computer chips | |
technical project management | ancient Raven wandering | real currency | |
web technologies | sculptured bust | trust | |
Google Analytics | placid bust | times | |
dollar solutions | bust of Pallas | propagation of imaginary kittens | |
team members | angels | legitimate companies | |
Google Adwords | entrance | world’s credit card system | |
director role | velvet-violet lining | medium of exchange | |
numerous initiatives | melancholy burden bore | excellent medium | |
numerous vendors | cushion’s velvet lining | plague of kittens |
Google Natural Language
Resources: |
Natural Language Product Page |
Natural Language Documentation |
Natural Language REST API Docs |
Natural Language Pricing |
Features
Google offers some interesting tools. Syntax Analysis and Entity Recognition look pretty cool although I’m not sure what practical use I would personally have for them. I had to read through the docs a bit to catch they lump keyword detection in with Entity Recognition.
Syntax Analysis
Extract tokens and sentences, identify parts of speech (PoS) and create dependency parse trees for each sentence.
Entity Recognition
Identify entities and label by types such as person, organization, location, events, products, and media.
Sentiment Analysis
Understand the overall sentiment expressed in a block of text.
Content Classification
Classify documents in predefined 700+ categories.
Multi-Language
Enables you to easily analyze text in multiple languages including English, Spanish, Japanese, Chinese (Simplified and Traditional), French, German, Italian, Korean and Portuguese.
Pricing
The Natural Language API is priced using units of measurement known as text records. A text record may contain up to 1,000 Unicode characters within the text content sent to the API for evaluation. Text in excess of these 1,000 characters counts as additional records. Prices are expressed in dollars per 1,000 text records (1,000,000 Unicode characters).
- Free up to 5K
- Over 5K “text records” processed cost depends on features used.
Limits
Google’s limits are based on text size, words/tokens in the text, and entity mentions. The API responds in different ways based on which limit you’ve exceeded. It’s not easy to describe so I’d recommend you read up on Googles Natural Language Quotas.
API
Google Cloud Platform does offer a PHP SDK but frankly, there were too many packages required to be installed to support it on my VM and I didn’t want to hassle with it. I decided to simply use their REST API instead. Your requests are authenticated by including your API key in the POST URI.
<?php // Provide your gcp API key to access the Text Analytics service $gcp_api_key = ''; $gcp_endpoint = 'https://language.googleapis.com/v1/documents:analyzeEntitySentiment?key='; // The text you want to analyze $text = ''; // Do the work and get your result $data = array( 'document' => array( 'type' => 'PLAIN_TEXT', 'language' => 'en', 'content' => "$text" ), 'encodingType' => 'UTF8' ); $data = json_encode($data); $gcp = curl_init(); curl_setopt($gcp, CURLOPT_URL, "$gcp_endpoint$gcp_api_key"); curl_setopt($gcp, CURLOPT_POST, 1); curl_setopt($gcp, CURLOPT_RETURNTRANSFER, true); curl_setopt($gcp, CURLOPT_HTTPHEADER, array( 'Content-Type: application/json', 'Accept: application/json' )); curl_setopt($gcp, CURLOPT_POSTFIELDS, $data); $response = curl_exec($gcp); $gcp_result = json_decode($response, true); curl_close($gcp); // $gcp_result will be an array you can access for the returned data echo 'Google Cloud Platform' . PHP_EOL; echo 'Keyphrases: ' . PHP_EOL; foreach ( $gcp_result['entities'] as $i ) { echo $i['name'] . PHP_EOL; echo 'Salience: ' . $i['salience'] . PHP_EOL; echo 'Sentiment: ' . PHP_EOL; echo ' - Score: ' . $i['sentiment']['score'] . PHP_EOL; echo ' - Magnitude: ' . $i['sentiment']['magnitude'] . PHP_EOL; echo PHP_EOL; } // Alternately, if you'd like to see the full set of data returned print_r($gcp_result); ?>
Results
Google’s keyword or phrase matching seems very tight, almost too tight. Note “Amazon” is stipped off from “Web Services”. Google breaks your document down to sentence entities and finds keywords in each entity so you will see duplication of keywords and differing salience values because of that. The salience scores seemed a bit off to what I would have thought. Specifically looking at Sample #3, Ethereum had a score of 0.0057588066 and cryptokittens the value of 0.001952143. Both of which I would have expected to be higher.
The Salience value looks to be a value between 0 and 1.0. The higher the relevance of the keyword to the text the higher the salience score will be.
Google’s Score is a value between -1.0 (Negative) and 1.0 (positive) and corresponds to the overall emotional leaning of the text. Strong positive and negative statements could balance each other out in the score. Generally, negative is -1.0 to -0.25, neutral is -0.25 to 0.25, and positive is 0.25 to 1.0.
Magnitude is from 0.0 to +inf. The higher the magnitude the stronger the positive or negative sentiment, based on the score. Lower magnitude indicates statements balancing others out, while higher values indicate the weight of the score overall. Because the Magnitude can extend to infinity it seems like there is far too much room to wonder about its value compared to scores for other blocks of text. Very subjective.
Sample #1 – LinkedIn Bio | Sample #2 – The Raven | Sample #3 – Editorial | Sample #4 – Dog Tweet |
IT professional Salience: 0.2407715 Sentiment: 0.1 [magnitude] 0 [score] | chamber door Salience: 0.33098266 Sentiment: 0.3 [magnitude] 0 [score] | cryptocurrencies Salience: 0.036509313 Sentiment: 0.9 [magnitude] 0.9 [score] | Meet Yogi Salience: 0.8369134 Sentiment: 0 [magnitude] 0 [score] |
cloud Salience: 0.098823205 Sentiment: 0.2 [magnitude] 0.2 [score] | Lenore Nameless Salience: 0.067261524 Sentiment: 2 [magnitude] -0.2 [score] | one Salience: 0.03649575 Sentiment: 0.5 [magnitude] 0.5 [score] | trip Salience: 0.07094695 Sentiment: 0 [magnitude] 0 [score] |
manager Salience: 0.0476862 Sentiment: 0 [magnitude] 0 [score] | ember Salience: 0.024451666 Sentiment: 1.2 [magnitude] -0.6 [score] | program Salience: 0.033484023 Sentiment: 0.1 [magnitude] 0 [score] | beach Salience: 0.048720725 Sentiment: 0 [magnitude] 0 [score] |
role Salience: 0.041792165 Sentiment: 0 [magnitude] 0 [score] | tapping Salience: 0.022468273 Sentiment: 0 [magnitude] 0 [score] | kittens Salience: 0.030432394 Sentiment: 0.9 [magnitude] -0.9 [score] | sand Salience: 0.043418925 Sentiment: 0.8 [magnitude] -0.8 [score] |
responsibilities Salience: 0.036698595 Sentiment: 0.4 [magnitude] 0.2 [score] | lore Salience: 0.02031062 Sentiment: 0 [magnitude] 0 [score] | internet Salience: 0.024274383 Sentiment: 0 [magnitude] 0 [score] | |
contributor Salience: 0.033336643 Sentiment: 0 [magnitude] 0 [score] | volume Salience: 0.02031062 Sentiment: 0.9 [magnitude] -0.9 [score] | contracts Salience: 0.018840473 Sentiment: 0 [magnitude] 0 [score] | |
organization Salience: 0.03316731 Sentiment: 0.1 [magnitude] 0.1 [score] | heart Salience: 0.016615199 Sentiment: 0.8 [magnitude] -0.4 [score] | cryptocurrency Salience: 0.017475905 Sentiment: 0 [magnitude] 0 [score] | |
solution Salience: 0.026711168 Sentiment: 0.7 [magnitude] -0.3 [score] | more Salience: 0.015859045 Sentiment: 0.1 [magnitude] -0.1 [score] | world computer Salience: 0.017475905 Sentiment: 0 [magnitude] 0 [score] | |
load balancing solution Salience: 0.019928368 Sentiment: 0.2 [magnitude] 0.1 [score] | nothing Salience: 0.015256954 Sentiment: 0.2 [magnitude] -0.2 [score] | cryptocurrencies Salience: 0.01685153 Sentiment: 0.2 [magnitude] -0.1 [score] | |
Program Management Salience: 0.016665569 Sentiment: 0.3 [magnitude] 0.3 [score] | word Salience: 0.015148128 Sentiment: 0.8 [magnitude] -0.4 [score] | plague Salience: 0.016228218 Sentiment: 0.7 [magnitude] -0.7 [score] | |
manager/director role Salience: 0.015977792 Sentiment: 0 [magnitude] 0 [score] | human being Salience: 0.014992289 Sentiment: 0.5 [magnitude] 0 [score] | game Salience: 0.014894219 Sentiment: 0 [magnitude] 0 [score] | |
company Salience: 0.014537535 Sentiment: 0.9 [magnitude] 0.9 [score] | visitor Salience: 0.012147822 Sentiment: 0.4 [magnitude] -0.4 [score] | banking system Salience: 0.014547589 Sentiment: 0 [magnitude] 0 [score] | |
expertise Salience: 0.0144847995 Sentiment: 0.1 [magnitude] 0.1 [score] | master Salience: 0.011745481 Sentiment: 1.2 [magnitude] -0.2 [score] | system Salience: 0.014547589 Sentiment: 0.1 [magnitude] -0.1 [score] | |
roles Salience: 0.013508974 Sentiment: 0.2 [magnitude] 0.2 [score] | Raven “Nevermore Salience: 0.011127895 Sentiment: 2.4 [magnitude] -0.2 [score] | computer code Salience: 0.014547589 Sentiment: 0 [magnitude] 0 [score] | |
goal Salience: 0.012724441 Sentiment: 0.9 [magnitude] 0.9 [score] | floor Salience: 0.010994431 Sentiment: 0.1 [magnitude] -0.1 [score] | parts Salience: 0.011588174 Sentiment: 0.8 [magnitude] 0.8 [score] | |
interests Salience: 0.012678256 Sentiment: 0.2 [magnitude] 0.2 [score] | morrow Salience: 0.010954482 Sentiment: 0 [magnitude] 0 [score] | neither Salience: 0.011522884 Sentiment: 0.4 [magnitude] 0.2 [score] | |
Web Services Salience: 0.011578682 Sentiment: 0 [magnitude] 0 [score] | ghost Salience: 0.009621214 Sentiment: 0.2 [magnitude] -0.2 [score] | news Salience: 0.0093484325 Sentiment: 0 [magnitude] 0 [score] | |
servers Salience: 0.011578682 Sentiment: 0.1 [magnitude] 0.1 [score] | maiden Salience: 0.009586242 Sentiment: 0.5 [magnitude] 0.5 [score] | dream Salience: 0.00748501 Sentiment: 0.5 [magnitude] 0.2 [score] | |
experience Salience: 0.010530363 Sentiment: 0 [magnitude] 0 [score] | books surcease Salience: 0.009586242 Sentiment: 0 [magnitude] 0 [score] | programmers Salience: 0.0069575156 Sentiment: 0 [magnitude] 0 [score] | |
customers Salience: 0.010236813 Sentiment: 0 [magnitude] 0 [score] | sorrowsorrow Salience: 0.009586242 Sentiment: 0 [magnitude] 0 [score] | need Salience: 0.0068155993 Sentiment: 0.3 [magnitude] -0.1 [score] |
IBM Cloud (Watson) Natural Language Understanding
Features
IBM’s feature set is the most mature among the services I tested. This shouldn’t be a surprise considering their success with Watson over the years and the training they’ve been able to apply to their artificial intelligence platform. This article only covers the sentiment and emotion functions.
Categories
Categorize your content using a five-level classification hierarchy. View the complete list of categories here.
Concepts
Identify high-level concepts that aren’t necessarily directly referenced in the text
Emotion
Analyze emotion conveyed by specific target phrases or by the document as a whole. You can also enable emotion analysis for entities and keywords that are automatically detected by the service.
Entities
Find people, places, events, and other types of entities mentioned in your content. View the complete list of entity types and subtypes here.
Keywords
Search your content for relevant keywords.
Metadata
For HTML and URL input, get the author of the webpage, the page title, and the publication date.
Relations
Recognize when two entities are related, and identify the type of relation.
Semantic Roles
Parse sentences into subject-action-object form, and identify entities and keywords that are subjects or objects of an action.
Sentiment
Analyze the sentiment toward specific target phrases and the sentiment of the document as a whole. You can also get sentiment information for detected entities and keywords by enabling the sentiment option for those features. As you’ll read below this maturity and ease of use applies to their API as well.
Pricing
Pricing depends on your overall IBM Cloud subscription plan. On the Lite (read free) plan you get up to 30,000 NLU items per month. Whereas if you upgrade to the paid Standard plan you are purely usage based on every item request, with discounts based on volume. I found it interesting that when you sign up for the Lite plan they don’t even ask for a credit card number.
An NLU item is based on the number of data units enriched and the number of enrichment features applied. A data unit is 10,000 characters or less. For example: extracting Entities and Sentiment from 15,000 characters of text is (2 Data Units * 2 Enrichment Features) = 4 NLU Items.
Limits
This was incredibly difficult to track down on the IBM website. Based on 30 July 2017 Release Notes, text greater than 50K chars will be truncated, the previous limit was 1 kilobytes (1024 bytes). Fifty thousand characters is considerably higher than the other services discussed in this post.
API
The IBM Cloud website only details their REST API so that’s what I used to code against. However, an after the fact search revealed there is a PHP SDK (https://github.com/CognitiveBuild/WatsonPHPSDK or https://github.com/ThomasIBM/php-sdk).
A nice plus for IBM’s API is that you can request multiple features in each API call. They still charge you per feature requested but it’s a lot more convenient to be able to ask once for everything you want rather than making individual calls for each function.
<?php // IBM will give you a specific username and password to access the NLU service $ibm_user = ''; $ibm_pass = ''; $ibm_endpoint = 'https://gateway.watsonplatform.net/natural-language-understanding/api/v1/analyze?version=2017-02-27'; // The text you want to analyze $text = ''; // Do the work and get your result $data = array( 'text' => "$text", 'features' => array( 'keywords' => array( 'sentiment' => true, 'emotion' => true, ) ), ); $data = json_encode($data); $ibm = curl_init(); curl_setopt($ibm, CURLOPT_URL, $ibm_endpoint); curl_setopt($ibm, CURLOPT_POST, 1); curl_setopt($ibm, CURLOPT_RETURNTRANSFER, true); curl_setopt($ibm, CURLOPT_HTTPHEADER, array( 'Content-Type: application/json', 'Accept: application/json' )); curl_setopt($ibm, CURLOPT_USERPWD, "$ibm_user:$ibm_pass"); curl_setopt($ibm, CURLOPT_HTTPAUTH, CURLAUTH_BASIC); curl_setopt($ibm, CURLOPT_POSTFIELDS, $data); $response = curl_exec($ibm); $ibm_result = json_decode($response, true); curl_close($ibm); // $ibm_result will be an array you can access for the returned data echo 'IBM Cloud / Watson' . PHP_EOL; echo 'Keyphrases: ' . PHP_EOL; foreach ( $ibm_result['keywords'] as $i ) { echo $i['text'] . PHP_EOL; echo 'Relevance: ' . $i['relevance'] . PHP_EOL; echo 'Sentiment: ' . $i['sentiment']['score'] . PHP_EOL; echo 'Emotions: ' . PHP_EOL; echo ' -Sadness: ' . $i['emotion']['sadness'] . PHP_EOL; echo ' -Joy: ' . $i['emotion']['joy'] . PHP_EOL; echo ' -Fear: ' . $i['emotion']['fear'] . PHP_EOL; echo ' -Disgust: ' . $i['emotion']['disgust'] . PHP_EOL; echo ' -Anger: ' . $i['emotion']['anger'] . PHP_EOL; echo PHP_EOL; } // Alternately, if you'd like to see the full set of data returned print_r($ibm_result); ?>
Results
Based on the maturity of Watson, perhaps I shouldn’t be surprised that I found its results the most relevant and accurate.
Watson returned tight and accurate keywords and phrases and caught most of what I felt were important. I was very happy with the results. I think Watson was the only engine to specifically pick out “Raven” for Sample #2. The relevance scores were pretty good but I found the sentiment and emotion values to be the most interesting in this context of keywords rather than the entire block of text. Perhaps it works better with dramatic text found in short stories and poems but to see the word “raven” get a relatively high level of certainty for Sadness, Joy, Fear, and Anger (leaving out only Disgust) seemed very appropriate.
Sentiment score ranging from -1 (negative sentiment) to 1 (positive sentiment). Emotion scores ranging from 0 to 1 for sadness, joy, fear, disgust, and anger. A 0 means the text doesn’t convey the emotion, and a 1 means the text definitely carries the emotion.
Sample #1 – LinkedIn Bio | Sample #2 – The Raven | Sample #3 – Editorial | Sample #4 – Dog Tweet |
cloud platform APIs Relevance: 0.995924 Sentiment: 0 Emotions: -Sadness: 0.055568 -Joy: 0.211603 -Fear: 0.046425 -Disgust: 0.013121 -Anger: 0.02802 | chamber door Relevance: 0.901334 Sentiment: -0.0842639 Emotions: -Sadness: 0.122429 -Joy: 0.629781 -Fear: 0.202819 -Disgust: 0.152286 -Anger: 0.245196 | real currency Relevance: 0.900134 Sentiment: -0.486542 Emotions: -Sadness: 0.06918 -Joy: 0.263669 -Fear: 0.086167 -Disgust: 0.036775 -Anger: 0.123255 | sand Relevance: 0.988505 Sentiment: -0.672347 Emotions: -Sadness: 0.601929 -Joy: 0.047119 -Fear: 0.177971 -Disgust: 0.35354 -Anger: 0.046075 |
Amazon Web Services Relevance: 0.946893 Sentiment: 0.32613 Emotions: -Sadness: 0.088253 -Joy: 0.082979 -Fear: 0.008404 -Disgust: 0.020818 -Anger: 0.0714 | lost Lenore Relevance: 0.501634 Sentiment: -0.568457 Emotions: -Sadness: 0.764473 -Joy: 0.008086 -Fear: 0.247296 -Disgust: 0.087756 -Anger: 0.197887 | Beanie Baby craze Relevance: 0.891312 Sentiment: -0.558869 Emotions: -Sadness: 0.724752 -Joy: 0.124824 -Fear: 0.095344 -Disgust: 0.04535 -Anger: 0.09825 | trip Relevance: 0.845717 Sentiment: 0 Emotions: -Sadness: 0.082996 -Joy: 0.768652 -Fear: 0.055204 -Disgust: 0.02813 -Anger: 0.057443 |
technical project management Relevance: 0.918105 Sentiment: 0.922231 Emotions: -Sadness: 0.239939 -Joy: 0.364821 -Fear: 0.112302 -Disgust: 0.038732 -Anger: 0.208574 | separate dying ember Relevance: 0.482499 Sentiment: -0.494962 Emotions: -Sadness: 0.727678 -Joy: 0.012482 -Fear: 0.320462 -Disgust: 0.055322 -Anger: 0.080813 | small Canadian company Relevance: 0.875852 Sentiment: 0.53669 Emotions: -Sadness: 0.081233 -Joy: 0.310396 -Fear: 0.007662 -Disgust: 0.533 -Anger: 0.103954 | beach Relevance: 0.836752 Sentiment: 0 Emotions: -Sadness: 0.082996 -Joy: 0.768652 -Fear: 0.055204 -Disgust: 0.02813 -Anger: 0.057443 |
page load times Relevance: 0.914624 Sentiment: 0.835434 Emotions: -Sadness: 0.108787 -Joy: 0.448846 -Fear: 0.035503 -Disgust: 0.004488 -Anger: 0.025469 | meaninglittle relevancy bore Relevance: 0.459712 Sentiment: -0.825676 Emotions: -Sadness: 0.337678 -Joy: 0.154829 -Fear: 0.321075 -Disgust: 0.050412 -Anger: 0.311102 | suitably malicious webpage Relevance: 0.865377 Sentiment: -0.781828 Emotions: -Sadness: 0.494608 -Joy: 0.107754 -Fear: 0.086254 -Disgust: 0.245535 -Anger: 0.212926 | |
agents increasing service Relevance: 0.9146 Sentiment: 0 Emotions: -Sadness: 0.2539 -Joy: 0.097912 -Fear: 0.065398 -Disgust: 0.13268 -Anger: 0.096632 | lamp-light gloating o’er Relevance: 0.445707 Sentiment: 0 Emotions: -Sadness: 0.188522 -Joy: 0.186324 -Fear: 0.182589 -Disgust: 0.200685 -Anger: 0.047279 | Ethereum Relevance: 0.747639 Sentiment: -0.513226 Emotions: -Sadness: 0.16149 -Joy: 0.191175 -Fear: 0.080751 -Disgust: 0.055043 -Anger: 0.517517 | |
complex multi-million-dollar solutions Relevance: 0.912594 Sentiment: 0.922231 Emotions: -Sadness: 0.239939 -Joy: 0.364821 -Fear: 0.112302 -Disgust: 0.038732 -Anger: 0.208574 | ominous bird Relevance: 0.442842 Sentiment: 0 Emotions: -Sadness: 0.21515 -Joy: 0.089418 -Fear: 0.196796 -Disgust: 0.15498 -Anger: 0.209489 | legitimate banking Relevance: 0.744846 Sentiment: 0 Emotions: -Sadness: 0.205808 -Joy: 0.159109 -Fear: 0.049965 -Disgust: 0.013543 -Anger: 0.14675 | |
HPE Helion Relevance: 0.798603 Sentiment: 0 Emotions: -Sadness: 0.154141 -Joy: 0.172429 -Fear: 0.095666 -Disgust: 0.076074 -Anger: 0.094764 | Raven Relevance: 0.440981 Sentiment: 0.424863 Emotions: -Sadness: 0.694532 -Joy: 0.690868 -Fear: 0.687529 -Disgust: 0.0456 -Anger: 0.680811 | fashionable cryptocurrencies Relevance: 0.73987 Sentiment: -0.420752 Emotions: -Sadness: 0.204408 -Joy: 0.413299 -Fear: 0.058367 -Disgust: 0.092789 -Anger: 0.086215 | |
cloud solutions Relevance: 0.791795 Sentiment: 0.32613 Emotions: -Sadness: 0.088253 -Joy: 0.082979 -Fear: 0.008404 -Disgust: 0.020818 -Anger: 0.0714 | Nevermore Relevance: 0.42179 Sentiment: 0.436053 Emotions: -Sadness: 0 -Joy: 0 -Fear: 0 -Disgust: 0 -Anger: 0 | imaginary kitten Relevance: 0.728115 Sentiment: -0.387433 Emotions: -Sadness: 0.201491 -Joy: 0.487955 -Fear: 0.101491 -Disgust: 0.053981 -Anger: 0.071164 | |
Google Adwords Relevance: 0.764266 Sentiment: 0 Emotions: -Sadness: 0.191606 -Joy: 0.134409 -Fear: 0.037958 -Disgust: 0.051213 -Anger: 0.064268 | gently rapping Relevance: 0.418179 Sentiment: 0 Emotions: -Sadness: 0.261638 -Joy: 0.01624 -Fear: 0.253446 -Disgust: 0.49803 -Anger: 0.30956 | imaginary kittens Relevance: 0.725727 Sentiment: 0.53669 Emotions: -Sadness: 0.081233 -Joy: 0.310396 -Fear: 0.007662 -Disgust: 0.533 -Anger: 0.103954 | |
Microsoft Azure Relevance: 0.75976 Sentiment: 0.32613 Emotions: -Sadness: 0.088253 -Joy: 0.082979 -Fear: 0.008404 -Disgust: 0.020818 -Anger: 0.0714 | midnight dreary Relevance: 0.4073 Sentiment: -0.722491 Emotions: -Sadness: 0.079813 -Joy: 0.253367 -Fear: 0.061463 -Disgust: 0.114973 -Anger: 0.038623 | large parts Relevance: 0.725197 Sentiment: 0 Emotions: -Sadness: 0.205808 -Joy: 0.159109 -Fear: 0.049965 -Disgust: 0.013543 -Anger: 0.14675 | |
anti-fraud solution Relevance: 0.756846 Sentiment: 0 Emotions: -Sadness: 0.279138 -Joy: 0.240411 -Fear: 0.111477 -Disgust: 0.118215 -Anger: 0.331585 | stately Raven Relevance: 0.404736 Sentiment: -0.368547 Emotions: -Sadness: 0.203589 -Joy: 0.334872 -Fear: 0.14134 -Disgust: 0.04234 -Anger: 0.071284 | gold vanish Relevance: 0.711218 Sentiment: -0.687961 Emotions: -Sadness: 0.506992 -Joy: 0.158692 -Fear: 0.133844 -Disgust: 0.156028 -Anger: 0.210273 | |
Google Analytics Relevance: 0.749963 Sentiment: 0 Emotions: -Sadness: 0.163427 -Joy: 0.163105 -Fear: 0.063402 -Disgust: 0.063864 -Anger: 0.079907 | curious volume Relevance: 0.404517 Sentiment: -0.365095 Emotions: -Sadness: 0.27674 -Joy: 0.138882 -Fear: 0.145411 -Disgust: 0.033914 -Anger: 0.126784 | tradeable value Relevance: 0.710171 Sentiment: -0.585778 Emotions: -Sadness: 0.312596 -Joy: 0.257172 -Fear: 0.098399 -Disgust: 0.010844 -Anger: 0.308837 | |
cloud-focused company Relevance: 0.749223 Sentiment: 0.713568 Emotions: -Sadness: 0.039198 -Joy: 0.585749 -Fear: 0.004257 -Disgust: 0.014587 -Anger: 0.005712 | fantastic terrors Relevance: 0.398956 Sentiment: 0 Emotions: -Sadness: 0.06198 -Joy: 0.8648 -Fear: 0.004159 -Disgust: 0.006427 -Anger: 0.004621 | Spectre flaws Relevance: 0.708683 Sentiment: 0 Emotions: -Sadness: 0.337198 -Joy: 0.366092 -Fear: 0.112892 -Disgust: 0.038348 -Anger: 0.095604 | |
manager/director role Relevance: 0.746573 Sentiment: 0.713568 Emotions: -Sadness: 0.039198 -Joy: 0.585749 -Fear: 0.004257 -Disgust: 0.014587 -Anger: 0.005712 | thy crest Relevance: 0.39861 Sentiment: 0 Emotions: -Sadness: 0.303813 -Joy: 0.161173 -Fear: 0.234302 -Disgust: 0.051151 -Anger: 0.083632 | latest manifestation Relevance: 0.708348 Sentiment: 0 Emotions: -Sadness: 0.198911 -Joy: 0.348413 -Fear: 0.209789 -Disgust: 0.013768 -Anger: 0.072017 | |
wide range Relevance: 0.745984 Sentiment: 0.32613 Emotions: -Sadness: 0.088253 -Joy: 0.082979 -Fear: 0.008404 -Disgust: 0.020818 -Anger: 0.0714 | uncertain rustling Relevance: 0.397321 Sentiment: -0.568999 Emotions: -Sadness: 0.117186 -Joy: 0.282085 -Fear: 0.03773 -Disgust: 0.065539 -Anger: 0.079408 | network briefly Relevance: 0.70786 Sentiment: -0.322578 Emotions: -Sadness: 0.185012 -Joy: 0.474092 -Fear: 0.127986 -Disgust: 0.114617 -Anger: 0.171347 | |
legal compliance Relevance: 0.741952 Sentiment: 0 Emotions: -Sadness: 0.122162 -Joy: 0.18645 -Fear: 0.12973 -Disgust: 0.074024 -Anger: 0.140856 | placid bust Relevance: 0.396545 Sentiment: -0.689534 Emotions: -Sadness: 0.893423 -Joy: 0.003638 -Fear: 0.140634 -Disgust: 0.085322 -Anger: 0.092297 | intense speculation Relevance: 0.705586 Sentiment: 0.551383 Emotions: -Sadness: 0.230166 -Joy: 0.424813 -Fear: 0.062365 -Disgust: 0.023201 -Anger: 0.074579 | |
web technologies Relevance: 0.739898 Sentiment: 0 Emotions: -Sadness: 0.299532 -Joy: 0.296768 -Fear: 0.054391 -Disgust: 0.034434 -Anger: 0.074903 | sculptured bust Relevance: 0.394975 Sentiment: -0.569724 Emotions: -Sadness: 0.132652 -Joy: 0.083568 -Fear: 0.237606 -Disgust: 0.228994 -Anger: 0.316378 | credit card Relevance: 0.703885 Sentiment: -0.558869 Emotions: -Sadness: 0.724752 -Joy: 0.124824 -Fear: 0.095344 -Disgust: 0.04535 -Anger: 0.09825 | |
world-class customers Relevance: 0.736487 Sentiment: 0.922231 Emotions: -Sadness: 0.239939 -Joy: 0.364821 -Fear: 0.112302 -Disgust: 0.038732 -Anger: 0.208574 | whispered word Relevance: 0.394947 Sentiment: 0 Emotions: -Sadness: 0.084248 -Joy: 0.310103 -Fear: 0.060499 -Disgust: 0.13849 -Anger: 0.190855 | sober realism Relevance: 0.700107 Sentiment: 0 Emotions: -Sadness: 0.194125 -Joy: 0.324705 -Fear: 0.089495 -Disgust: 0.043107 -Anger: 0.062661 | |
Stratoscale Symphony Relevance: 0.733062 Sentiment: 0.292848 Emotions: -Sadness: 0.131983 -Joy: 0.191748 -Fear: 0.08108 -Disgust: 0.07167 -Anger: 0.084732 | radiant maiden Relevance: 0.393484 Sentiment: 0.745542 Emotions: -Sadness: 0.098052 -Joy: 0.587861 -Fear: 0.032495 -Disgust: 0.004877 -Anger: 0.023884 | cartoon cats Relevance: 0.699942 Sentiment: 0.53669 Emotions: -Sadness: 0.081233 -Joy: 0.310396 -Fear: 0.007662 -Disgust: 0.533 -Anger: 0.103954 | |
numerous initiatives Relevance: 0.729417 Sentiment: 0 Emotions: -Sadness: 0.072441 -Joy: 0.10439 -Fear: 0.131526 -Disgust: 0.035598 -Anger: 0.056772 | thy memories Relevance: 0.392245 Sentiment: 0 Emotions: -Sadness: 0.462287 -Joy: 0.351459 -Fear: 0.040361 -Disgust: 0.011831 -Anger: 0.0264 | excellent medium Relevance: 0.699505 Sentiment: 0.53669 Emotions: -Sadness: 0.081233 -Joy: 0.310396 -Fear: 0.007662 -Disgust: 0.533 -Anger: 0.103954 |
I’d love to read your questions and comments on these services. How do you think you can use these in your business or project?