Overview

The Linguistic Analysis service of the Text Analysis Linguistic Analysis package is part of a series of enterprise-grade, natural-language products for gaining new insights from unstructured text sources. This service accepts documents, social media posts, email messages, search queries, and other forms of text, returning individual words and their parts of speech, stems, sentence and paragraph numbers, and other information that applications such as search engines and natural language computation can use.

Some of the capabilities the service provides are:

  • Tokenization -- separating words and punctuation and discarding spaces: “card-based payment systems.” → “card” “based” “payment” “systems” "."
  • Stemming -- returning the root or roots of a word: “ran”,"running", "runs" → “run”; “Häuser” → “Haus” (Stems are the root or roots of a word. For example, "run" is the stem of "run", "running", "ran", and "runs")
  • Part-of-speech tagging -- indicating the sense or function of a word given its context: “quickly” = adjective; “Jeffrey” = proper noun
  • Noun group identification -- find clusters of words around a NOUN used to give more information about a person or a thing: "The mobile operating system is geared to process text data." → “mobile operating system”, “text data”
  • Language identification -- determine the dominant language used in a document: "Je suis désolé, Dave. Je crains de ne pas pouvoir faire ça." → "fr"

Use cases

This section describes a few potential uses of the Linguistic Analysis service.

Text search

Assume you are creating a search index for a set of documents in a variety of formats, such as plain text, PDF, Microsoft Word or Excel, in English, German, French, Italian, Portuguese, and Spanish. You apply the common practice of storing the stems of the words that appear in documents.

Later, when a user enters a search query, you search for the stems of the words in the query. Storing and searching for stems finds documents even if the search query contains different forms of a word than appear in the document. For example, if the document contains the sentence "In 2016, Bernie Sanders ran for the presidency." and the search query is "Sanders' run for presidency", the document still matches the query.

In the response from this service, the token attribute contains the word as it originally appeared in the document and stems contains the word's stem(s). However, if the original word is a stem, stems is empty and normalizedToken contains the stem.

You pass your documents and user queries to the Linguistic Analysis service and use the value of the stems or normalizedToken attributes of the response, ignoring the token attribute.

Because this service can process plain text, PDF, Word, Excel, and other formats, you can pass all documents for indexing directly to the service, using the content-type of application/octet-stream in the request header, without converting them to text first. User queries are in plain text, so you pass them with a content-type ofapplication/json.

Because this service supports all six of the languages in your collection, you can pass all documents and queries to the service and know it will return the correct stems.

Experimental information retrieval algorithms Using HANA

You have a collection of documents upon which you want to experiment with some interesting information retrieval algorithms. All of the algorithms require this information for every word in the document collection:

  • original form as it appears in the document
  • its stem(s)
  • its position, to measure how far apart words are from each other
  • part of speech
  • sentence number
  • paragraph number

The collection might contain documents written in languages other than English, but you want to experiment only with those in English.

To experiment with your algorithms, you create a HANA database table containing the words in the document collection.

You open each document file, pass it to the Linguistic Analysis service, and check the language code to make sure it is "en", for English. If it is, you save each token, stems, normalizedToken, offset, partOfSpeech, sentence, and paragraph returned by the service, along with the document's filename, in your HANA database. When your database contains a sufficient number of documents, you begin your experimentation while continuing to grow your database in the background.


This service can identify 33 languages: Arabic, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Farsi, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian (Bokmål), Norwegian (Nynorsk), Polish, Portuguese, Romanian, Russian, Serbian (Cyrillic), Serbian (Latin), Slovak, Slovenian, Spanish, Swedish, Thai, and Turkish.




The service accepts input in a wide variety of formats:

  • Abobe PDF
  • Generic email messages (.eml)
  • HTML
  • Microsoft Excel
  • Microsoft Outlook email messages (.msg)
  • Microsoft PowerPoint
  • Microsoft Word
  • Open Document Presentation
  • Open Document Spreadsheet
  • Open Document Text
  • Plain Text
  • Rich Text Format (RTF)
  • WordPerfect
  • XML

The size of each input file is limited to 1 MB.

The tenant parameter is not required because the service is stateless and no data is persisted.


API Reference

/

/

post

Tokenize input and return the normalized form, part of speech, and stem of each token in addition to positional information. An in-depth description of linguistic analysis is available in the SAP HANA Text Analysis Language Reference Guide.

The partOfSpeech property can be any of the following values:

  • abbreviation
  • adjective
  • adverb
  • auxiliary verb
  • conjunction
  • determiner
  • interjection
  • noun
  • number
  • particle
  • preposition
  • pronoun
  • proper name
  • punctuation
  • verb
  • unknown

These parts of speech are simplified from finer grained values used internally within text analysis, as explained in the Note at the end of Structure of the $TA Table in the SAP HANA Text Analysis Developer Guide. The internal parts of speech of each language can be found in the Language-Specific Part-of-Speech Tagging Examples section of the SAP HANA Text Analysis Language Reference Guide.


The stems and normalizedToken members

The stems member is an array because a word can have more than one possible stem. For example, the word "driving" has the stem "drive" in the verb sense, as in "driving a car." However, it is a stem as an adjective, as in "driving rain." If the word appears in the context of a sentence, the service can usually determine which sense is in use. If there is no context or it is ambiguous, the service returns all possible stems.

When a word is used in an input document in its stem form, the stems member is empty and instead the stem appears in the normalizedToken member's value.


Default Language

The default language is either the first value listed in the languageCodes input parameter (see Setting a subset of languages in this topic) or English if the languageCodes input parameter is not specified.


Setting a subset of languages

You can instruct the service to choose from a specific, reduced set of languages by setting the languageCodes input parameter. This forces the service to choose from one of the languages you supply.
Use this setting with caution. If, for example, you set languageCodes to Danish, German, or Dutch and the input text is in Russian, the service cannot return Russian. It must return the default.


Meaning of the textSize value

The returned attribute textSize represents the amount of character data in the input, not the number of bytes. If the input is in plain text file without accented characters, textSize equals the input file's size. However, if the input is a binary file such as a PDF or Microsoft Word document, the textSize will probably be much smaller than the file size, especially if the file contains a lot of non-textual data such as an embedded image.


Annotated JSON schema

The JSON schema contains the descriptions of the objects and members of the JSON response that the service returns. To read the schema, click the POST link in the API Reference then switch to the RESPONSE tab.


Further references

You can find extensive details on the capabilities and behavior of SAP's linguistic analysis technology in the Linguistic Analysis chapter of the SAP HANA Text Analysis Language Reference Guide.


Python Tutorials

Simple example with static data and default options

In this Python tutorial, you have some static text and you want to determine the part of speech for each word.

Get an access token

To use the service, you must pass an access token in each call. Get the token from the OAuth2 service.

import requests
import json

# Replace the two following values with your client secret and client ID.
client_secret = 'clientSecretPlaceholder'
client_id = 'clientIDPlaceholder'

s = requests.Session()

# Get the access token from the OAuth2 service.
auth_url = 'https://api.beta.yaas.io/hybris/oauth2/v1/token'
r = s.post(auth_url, data= {'client_secret':client_secret, 'client_id':client_id,'grant_type':'client_credentials'})
access_token = r.json()['access_token']

Call the service

The POST request body includes a single value: the text upon which to perform linguistic analysis. You could load the contents of a binary file, web page, text stream, or other resource into the request. In this example, you use static text that you typed into your Python source code.

# The Linguistic Analysis service's URL
service_url = 'https://api.beta.yaas.io/sap/ta-linguistics/v1/'

# The example text whose parts of speech you want to see. 
service_text = 'Guten Morgen, Herr Veezner. Wie geht es? Haben Sie etwas zu verzollen?'

# HTTP request headers
req_headers = {}

# Set content-type to 'application/json' to pass plain text to the service. Specify the text's encoding as UTF-8.
req_headers['content-type'] = 'application/json; charset=utf-8'
req_headers['Cache-Control'] = 'no-cache'
req_headers['Connection'] = 'keep-alive'
req_headers['Accept-Encoding'] = 'gzip'
req_headers['Authorization'] = 'Bearer {}'.format(access_token)

# Make the REST call of the Language Identification service
response = s.post(url = service_url,  headers = req_headers, data = json.dumps({'text':service_text}))


Display the returned JSON response on the console. The first 25 lines of output from this tutorial should be:

{ 
    "mimeType": "text/plain",  
    "tokens": [ 
        { 
            "normalizedToken": "guten",  
            "sentence": 1,  
            "stems": [ 
                "gut" 
            ],  
            "token": "Guten",  
            "paragraph": 1,  
            "partOfSpeech": "adjective",  
            "offset": 0 
        },  
        { 
            "normalizedToken": "morgen",  
            "sentence": 1,  
            "stems": [ 
                "Morgen" 
            ],  
            "token": "Morgen",  
            "paragraph": 1,  
            "partOfSpeech": "noun",  
            "offset": 6 
        },  

Each token in the input appears in the response, in the "tokens" array, in order of appearance. Every token has seven attributes:

  • normalizedToken
  • sentence
  • stems
  • token
  • paragraph
  • partOfSpeech - This is the attribute you are interested in for this tutorial.
  • offset
  • The JSON schema for the response contains a detailed description of each attribute. See the Details section of this service for a link to the schema.


    The last few lines of the response are:

            { 
                "normalizedToken": "?",  
                "sentence": 3,  
                "stems": [],  
                "token": "?",  
                "paragraph": 1,  
                "partOfSpeech": "punctuation",  
                "offset": 69 
            } 
        ],  
        "textSize": 70,  
        "language": "de" 
    } 
    

    The textSize attribute indicates the number of characters (70) in the input text, and the language attribute contains the ISO 639-1 code of the input text. The code for German is "de".

    # Print result
    if response.status_code == 200:
            # De-serialize the JSON reply and get the tokens list.
            response_dict = json.loads(response.text)
            response_tokens = response_dict.get('tokens','Oops - return is missing the tokens attribute.')
            print 'Input tokens and their parts of speech'
            for t in response_tokens:
                    for key, value in t.iteritems():
                            if key == 'token':
                                    curr_token = value
                            if key == 'partOfSpeech':
                                    curr_POS = value
                    print('token: ' + curr_token + '\t\tpart of speech: ' + curr_POS)
    else:
        print 'Error', response.status_code
        print response.text
    

    Specify a list of languages in a binary document

    In this tutorial, you know that the text must be in one of three languages: English, French, or Spanish. You have a Microsoft Word file, named French.docx, that contains the text "Je suis désolé, Dave. Je crains de ne pas pouvoir faire ça." You will print the part of speech of each token.

    This example does not show the required steps to obtain an access token. Please see the first example in this tutorial for those instructions.

    Read the contents of the Word file

    filename = "French.docx" 
    f = open('French.docx', 'rb')
    service_binarydata = f.read()
    

    Identify the language as either English, Spanish, or French

    This tutorial's POST request includes the languageCodes parameter to force the service to select from three languages.

    # Append "?languageCodes=en%2Ces%2Cfr" to the Linguistic Analysis URL to restrict identification to Spanish ('es'), English ('en'), or French ('fr'). Separate each language with a comma, represented as '%2C' within URLs.
    service_url = 'https://api.beta.yaas.io/sap/ta-linguistics/v1?languageCodes=es%2Cen%2Cfr'
    
    # HTTP request headers
    req_headers = {}
    # This example uses a different content-type value to pass binary data.
    req_headers['content-type'] = 'application/octet-stream'
    req_headers['Cache-Control'] = 'no-cache'
    req_headers['Connection'] = 'keep-alive'
    req_headers['Accept-Encoding'] = 'gzip'
    req_headers['Authorization'] = 'Bearer {}'.format(access_token)
    
    # Make the REST call to the Linguistic Analysis service. Pass the binary data in raw form. Do not base64-encode the data.
    response = s.post(url = service_url,  headers = req_headers, data = service_binarydata)
    

    Print each token and its part of speech

    As in the first example in this tutorial, the JSON response includes attributes about the document and every token within it:

    { 
        "mimeType": "application/msword", 
        "tokens": [ 
            { 
                "normalizedToken": "je", 
                "sentence": 1, 
                "stems": [], 
                "token": "Je", 
                "paragraph": 1, 
                "partOfSpeech": "pronoun", 
                "offset": 0 
            }, 
            { 
                "normalizedToken": "suis", 
                "sentence": 1, 
                "stems": [ 
                    "\u00eatre" 
                ], 
                "token": "suis", 
                "paragraph": 1, 
                "partOfSpeech": "auxiliary verb", 
                "offset": 3 
            }, 
            { 
                "normalizedToken": "desole", 
    

    The mimeType attribute shows that the binary data was correctly identified as Microsoft Word.

    The last few lines of the service's response are:

            {    
                "normalizedToken": ".", 
                "sentence": 2, 
                "stems": [],  
                "token": ".", 
                "paragraph": 1,   
                "partOfSpeech": "punctuation", 
                "offset": 58 
            }    
        ],   
        "textSize": 64, 
        "language": "fr" 
    }            
    


    The value returned for textSize is 64, while the text "Je suis désolé, Dave. Je crains de ne pas pouvoir faire ça." contains 59 characters. The conversion of Microsoft Word files to a plain-text equivalent within the Text Analysis services accounts for this apparent discrepancy. The accented characters 'é' and 'ç' are internally represented by multi-byte characters in the service.

    Rather than print everything in the response, print just the tokens and their parts of speech.

    if response.status_code == 200:
        # De-serialize the JSON reply and get the tokens list.
        response_dict = json.loads(response.text)
        response_tokens = response_dict.get('tokens','Oops - return is missing the tokens attribute.')
        print 'Input tokens and their parts of speech'
        for t in response_tokens:
            for key, value in t.iteritems():
                if key == 'token':
                    curr_token = value
                if key == 'partOfSpeech':
                    curr_POS = value
            print('token:' + curr_token + '\tpart of speech: ' + curr_POS)
    else:
        print 'Error', response.status_code
        print response.text
    

    Create a search index

    This final example is a small modification of the previous one. You are creating a search index for a set of documents, as described in the first of the Use Cases in this tutorial.
    Your example search index application's Python interface contains a search_index object whose store_term method takes a document identifier, a string (the stem), and the string's offset into the document.

    if response.status_code == 200:
        # De-serialize the JSON reply and get the tokens list.
        response_dict = json.loads(response.text)
        response_tokens = response_dict.get('tokens','Oops - return is missing the tokens attribute.')
        for t in response_tokens:
            for key, value in t.iteritems():
                if key == 'stems':
                    curr_stems = value
                if key == 'partOfSpeech':
                    curr_POS = value
                if key == 'normalizedToken':
                    curr_normalizedToken = value
                if key == 'offset':
                    curr_offset = value
            if curr_POS != 'punctuation': # don't store punctuation in the index
                # The stem's attribute is empty if the stem form is used in the document.
                # In that case, store the normalizedToken, which is the lowercase form 
                # if the word was capitalized only because it began a sentence, or is the
                # unaccented form if the word contained accented characters.
                if not curr_stems: 
                    search_index.store_term(filename,curr_normalizedToken,curr_offset)
                else:
                    # A word can have more than one stem. For example, the stem for "driving"
                    # in "the driving rain" is "driving", but in "I was driving"
                    # the stem is "drive", so use a for loop.
                    for s in curr_stems:
                        search_index.store_term(filename,s,curr_offset)
    else:
        print 'Error', response.status_code
        print response.text
    


    • Send feedback

      If you find any information that is unclear or incorrect, please let us know so that we can improve the Dev Portal content.

    • Get Help

      Use our private help channel. Receive updates over email and contact our specialists directly.

    • hybris Experts

      If you need more information about this topic, visit hybris Experts to post your own question and interact with our community and experts.