Overview

The Public Sector Fact Extraction service of the Text Analysis Fact Extraction package is part of a series of enterprise-grade, natural-language products that find domain-specific relationships between entities in input texts. The package's Public Sector Fact Extraction service identifies public events. It helps analyze questions such as "What types of events are taking place in a particular area?" and "Do any of these activities represent a potential safety issue?"

The service can help identify potential risks by identifying the variables listed.

  • Domain-specific entity types:
    • Facilities, with subtypes like airports, buildings, and plants
    • domestic and international geographic areas
    • geographic features, with subtypes such as boundaries, land or water
    • geographic coordinates, weapons materials
    • vehicles, including air, land, water, license plate, and VIN
    • weapons, such as biological, chemical, exploding, nuclear, projectile, and shooting
  • Public-sector fact types:
    • person alias
    • person appearance
    • person attributes
    • person relationships
    • spatial references

Use case

An example use of this service is an application that scans brief, unstructured field reports of military incidents, such as 27 JAN 2017 13:33 - IRAQI TROOPS CAPTURED MILITANTS OUTSIDE MOSUL. You use this example service to look for patterns in behavior to predict possible future events. To make these kinds of projections, you need to identify date and time stamps, locations, actors, weaponry, and other data. You feed these reports to the Public Sector Fact Extraction service and save the entities it returns. For the example text in italics, the service returns the following:

Text

27 JAN 2017 13:33 - IRAQI TROOPS CAPTURED MILITANTS OUTSIDE MOSUL.

27 JAN 2017

13:33

IRAQI TROOPS

MILITANTS

OUTSIDE MOSUL

OUTSIDE

MOSUL

Entity Type(s)

Action_Capture_Active

Date

Time

Agent

Patient

SpatialReference_Vague

Direction

Place

Hierarchical Position

parent

child

child

child

child

parent

child

child




The service supports text in English.

The service accepts input in a wide variety of formats:

  • Abobe PDF
  • Generic email messages (.eml)
  • HTML
  • Microsoft Excel
  • Microsoft Outlook email messages (.msg)
  • Microsoft PowerPoint
  • Microsoft Word
  • Open Document Presentation
  • Open Document Spreadsheet
  • Open Document Text
  • Plain Text
  • Rich Text Format (RTF)
  • WordPerfect
  • XML

The size of each input file is limited to 100 kB.

The tenant parameter is not required because the service is stateless and no data is persisted.


API Reference

/

/

post

Extract public sector entities, (e.g., travel events, personal appearance attributes, spatial references) in English input documents. For detailed definitions of the types of public sector entities that text analysis can identify, refer to the Public Sector Fact Extraction section of the SAP HANA Text Analysis Language Reference Guide.


An empty entities array is normal

It is not an error if the service returns an empty entities array. Not all text contains entities as defined and recognized by the service. For example, the English sentence "It's the end of the world as we know it, and I feel fine" contains no entities, nor do any of these translations:

Spanish:

German:

Korean:

Russian:

Es el fin del mundo tal como lo conocemos, y me siento bien.

Es ist das Ende der Welt, wie wir es kennen, und ich fühle mich gut.

우리가 알고있는대로 그것은 세상의 종말이며, 나는 기분이 좋아집니다.

Это конец света, как мы его знаем, и я чувствую себя прекрасно.


The label and labelPath members

Some of the entity types that the service identifies are given general categories and then subdivided into more specific classifications. For example, URI is a general category with the subtypes EMAIL, IP, and URL. The label member of the entities array is the most specific entity type of an extracted entity. The labelPath member includes the general category and the subtype, separated by a forward slash ("/").
That means, if a document contains the web address "http://www.sap.com" in its text, that string is extracted as an entity with its label attribute's value set to URL and its labelPath set to URI/URL.
If the entity type does not have subtypes, for example, PERSON, the label and labelPath values are identical.


How parent and child entities are linked in the JSON response

Some entities are composed of other entities. This hierarchical relationship is indicated in the JSON response by an extra attribute, parent, that appears only on child entities. It contains an integer value that matches the id value of the parent.

The parent entity always appears before its children in the entities array. The order in which children appear is not guaranteed.


Default Language

The default language is either the first value listed in the languageCodes input parameter (see Setting a subset of languages in this topic) or English if the languageCodes input parameter is not specified.


Setting a subset of languages

You can instruct the service to choose from a specific, reduced set of languages by setting the languageCodes input parameter. This forces the service to choose from one of the languages you supply.
Use this setting with caution. If, for example, you set languageCodes to Danish, German, or Dutch and the input text is in Russian, the service cannot return Russian. It must return the default.


Meaning of the textSize value

The returned attribute textSize represents the amount of character data in the input, not the number of bytes. If the input is in plain text file without accented characters, textSize equals the input file's size. However, if the input is a binary file such as a PDF or Microsoft Word document, the textSize will probably be much smaller than the file size, especially if the file contains a lot of non-textual data such as an embedded image.


Annotated JSON schema

The JSON schema contains the descriptions of the objects and members of the JSON response that the service returns. To read the schema, click the POST link in the API Reference then switch to the RESPONSE tab.


Further references

You can find extensive details on the capabilities and behavior of SAP's public sector fact extraction technology in the Public Sector Fact Extraction chapter of the SAP HANA Text Analysis Language Reference Guide (PDF).

For a list of all public sector entity types that the service returns, see the Entity Type Names and Subentity Type Names For Fact Extraction table of the SAP HANA Text Analysis Language Reference Guide. Look for entities whose Family column starts with "PublicSector".


Python Tutorial

This tutorial mirrors the use case described in the Overview. This tutorial shows how to retrieve and store entities. It does not illustrate the analysis phase of the application described in the use case.

Analyze military field reports

For this tutorial, you are using the Public Sector Fact Extraction service to extract aspects of events described in brief, unstructured military field reports, which you will analyze in an attempt to predict the time and location of future events.

Get an access token

To use the service, you must pass an access token in each call. Get the token from the OAuth2 service.

import requests
import json

# Replace the two following values with your client secret and client ID.
client_secret = 'clientSecretPlaceholder'
client_id = 'clientIDPlaceholder'

s = requests.Session()

# Get the access token from the OAuth2 service.
auth_url = 'https://api.beta.yaas.io/hybris/oauth2/v1/token'
r = s.post(auth_url, data= {'client_secret':client_secret, 'client_id':client_id,'grant_type':'client_credentials'})
access_token = r.json()['access_token']

Call the service

The POST request body for this service includes a single value: the text upon which to perform public sector fact extraction. Your variable field_report is an individual plain text field report.

# The Public Sector Fact Extraction service's URL
service_url = 'https://api.beta.yaas.io/sap/ta-publicsectorfacts/v1/'

# HTTP request headers
req_headers = {}

# Set content-type to 'application/json' to pass a plain text document to the service. Specify the plain text's encoding as UTF-8.
req_headers['content-type'] = 'application/json; charset=utf-8'
req_headers['Cache-Control'] = 'no-cache'
req_headers['Connection'] = 'keep-alive'
req_headers['Accept-Encoding'] = 'gzip'
req_headers['Authorization'] = 'Bearer {}'.format(access_token)

# Make the REST call to the Public Sector Fact Extraction service. 
response = s.post(url = service_url,  headers = req_headers, data = json.dumps({'text':field_report}))


Taking the sample text from the use case presented in the Overview, 27 JAN 2017 13:33 - IRAQI TROOPS CAPTURED MILITANTS OUTSIDE MOSUL., the JSON response from the service starts with:

{
    "mimeType": "text/plain", 
    "entities": [
        {
            "sentence": 1, 
            "text": "27 JAN 2017", 
            "label": "DATE", 
            "paragraph": 1, 
            "offset": 0, 
            "normalizedForm": "", 
            "id": 1, 
            "labelPath": "DATE"
        }, 
        {
            "sentence": 1, 
            "text": "27 JAN 2017 13:33 - IRAQI TROOPS CAPTURED MILITANTS OUTSIDE MOSUL.", 
            "label": "Action_Capture_Active", 
            "paragraph": 1, 
            "offset": 0, 
            "normalizedForm": "", 
            "id": 2, 
            "labelPath": "Action_Capture_Active"
        }, 
        {
            "parent": 2, 
            "sentence": 1, 
            "text": "27 JAN 2017", 
            "label": "Date", 
            "paragraph": 1, 
            "offset": 0, 
            "normalizedForm": "", 
            "id": 3, 
            "labelPath": "Date"
        }, 

The last few lines of the response are:

        {
            "parent": 10, 
            "sentence": 1, 
            "text": "MOSUL", 
            "label": "Place", 
            "paragraph": 1, 
            "offset": 60, 
            "normalizedForm": "", 
            "id": 13, 
            "labelPath": "Place"
        }, 
        {
            "sentence": 1, 
            "text": "MOSUL", 
            "label": "LOCALITY", 
            "paragraph": 1, 
            "offset": 60, 
            "normalizedForm": "", 
            "id": 14, 
            "labelPath": "LOCALITY"
        }
    ], 
    "textSize": 67, 
    "language": "en"
}

Each entity extracted from the input text appears in the response's "entities" array in order of appearance. Public sector entities are more complex than those returned by other text analysis services. They can have multiple layers of hierarchy, as in a parent←→child relationship. This diagram illustrates how the entities in this tutorial's example text are related to one another:

graph TD; root1(Action_Capture_Active

27 JAN 2017 13:33 - IRAQI TROOPS CAPTURED MILITANTS OUTSIDE MOSUL.)-->date(Date

27 JAN 2017); root1-->time(Time

13:33); root1-->agent(Agent

IRAQI TROOPS); root1-->patient(Patient

MILITANTS); root1-->place(Place

MOSUL); root2(SpatialReference_Vague

OUTSIDE MOSUL)-->direction(Direction

OUTSIDE); root2-->place;

classDef blackBox fill::#000000; class root1,root2,date,time,agent,patient,place,direction blackBox;

Every entity returned has at least eight attributes:

  • sentence
  • text
  • label
  • paragraph
  • offset
  • normalizedForm
  • id
  • labelPath

For a detailed description of each attribute in the response, see the link to the JSON schema in the Details section of this service.

Some entities have a ninth attribute, parent, which associates child entities with the parents they comprise.

Presume you care only about action-oriented entities, for example, those whose label attribute starts with Action_. See the Details section for a link to a list of all Public Sector entity types.

Given the example text, you store the Action_Capture_Active entity and its children, but you do not store the SpacialReference_Vague entity.

Your application needs the parent/child relationships between entities to perform its analysis. The ActionEntity is a class defined in your example code that contains one action entity and its children.

# Process the result
if response.status_code == 200:
    # De-serialize the JSON reply and get the entities list.
    response_dict = json.loads(response.text)
    # If the service returns no entities, it is not an error, it just means no
    # public sector entities are contained in the text. 
    # Thus, the second parameter of the get() call is left out.
    entities = response_dict.get('entities')

    found_one_action = None  # found at least 1 action in the response

    for e in entities:
        entity_type = e.get('label')
        # At an action-oriented entity (an entity whose type begins with
        # "Action_")? They always precede their children in the entity array.
        if (entity_type.find('Action_',0) != -1):
            # If already encountered another action-oriented entity in the array, 
            # now's the time to save it.
            if found_one_action:
                store_action(action_entity)

            found_one_action = True
            action_entity = ActionEntity(e.get('text'), e.get('text'), 
                e.get('id'))
        elif (found_one_action and e.get('parent') == action_entity.id):
            # This is a child entity of the current action. Tack it on.
            action_entity.add_child(e.get('text'),e.get('label'))

    # End of entities array. If found any, the last one still needs to be
    # stored.
    if found_one_action:
        store_action(action_entity)
else:
    print 'Error', response.status_code
    print response.text


  • Send feedback

    If you find any information that is unclear or incorrect, please let us know so that we can improve the Dev Portal content.

  • Get Help

    Use our private help channel. Receive updates over email and contact our specialists directly.

  • hybris Experts

    If you need more information about this topic, visit hybris Experts to post your own question and interact with our community and experts.