Entity Extraction: HTML API

AlchemyAPI provides easy-to-use facilities for extracting the semantic richness from any web page: Post (upload) any content directly to our service for analysis.

Posted content is normalized / cleaned (removing ads, navigation links, and other unimportant content), the primary document language is detected, and named entities, topic keywords, and other content are extracted automatically.

These API calls may be utilized to process posted (uploaded) webpages and other HTML content. If you are processing content hosted on a publicly accessible website, consider using our URL processing calls instead.

HTMLGetRankedNamedEntities Extract a grouped, relevancy-ranked list of named entities from a web page.
HTMLGetNamedEntities Extract a non-grouped, non-ranked list of all named entities from a web page.
HTMLGetAnnotatedNamedEntityText Annotate text based on the named entities detected within a web page.

API Call: HTMLGetRankedNamedEntities

Description: The HTMLGetRankedNamedEntities call is utilized to extract a grouped, relevancy-ranked list of named entities (people, companies, organizations, etc.) from a posted HTML document. AlchemyGrid will extract text from the posted HTML document (ignoring navigation links, advertisements, and other undesireable content), and perform entity extraction operations.

Endpoint: http://access.alchemyapi.com/calls/html/HTMLGetRankedNamedEntities

Parameters:

http argument parameter description
apikey your private api key

(required parameter)
html HTML document content (must be uri-argument encoded)

(required parameter)
url HTML document URL

(optional parameter, must be uri-argument encoded)
outputMode desired API output format

Possible values:
xml (default)
json
rdf
rel-tag
rel-tag-raw

(optional parameter)
disambiguate whether to disambiguate detected entities.

Possible values:
1 - enabled (default)
0 - disabled

(optional parameter)
linkedData whether to include Linked Data content links with disambiguated entities.

Possible values:
1 - enabled (default)
0 - disabled

(optional parameter. disambiguation must be enabled to utilize the linkedData feature.)
coreference whether to resolve he/she/etc coreferences into detected entities.

Possible values:
1 - enabled (default)
0 - disabled

(optional parameter)
quotations whether to enable quotations extraction.

Possible values:
1 - enabled
0 - disabled (default)

(optional parameter)
showSourceText whether to include the original 'source text' the entities were extracted from within the API response.

Possible values:
1 - enabled
0 - disabled (default)

(optional parameter)
sourceText where to obtain the text that will be processed by this API call.

AlchemyAPI supports multiple modes of text extraction: web page cleaning (removes ads, navigation links, etc.), raw text extraction (processes all web page text, including ads / nav links), visual constraint queries, and XPath queries.

Possible values:
cleaned_or_raw cleaning enabled, fallback to raw when cleaning produces no text (default)
cleaned operate on 'cleaned' web page text (web page cleaning enabled)
raw operate on raw web page text (web page cleaning disabled)
cquery operate on the results of a visual constraints query

Note: The 'cquery' http argument must also be set to a valid visual constraints query.
xpath operate on the results of an XPath query

Note: The 'xpath' http argument must also be set to a valid XPath query.
(optional parameter)
cquery a visual constraints query to apply to the web page.

Constraint queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.

(optional parameter, used when sourceText is set to 'cquery'. must be uri-argument encoded)
xpath an XPath query to apply to the web page.

XPath queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.

(optional parameter, used when sourceText is set to 'xpath'. must be uri-argument encoded)
maxRetrieve maximum number of named entities to extract (default: no limit)

(optional parameter)
baseUrl rel-tag output base http url

(optional parameter, used with rel-tag or rel-tag-raw outputMode. must be uri-argument encoded)

Response Format (XML):

<results>
    <status>REQUEST_STATUS</status>
    <language>DOCUMENT_LANGUAGE</language>
    <url>DOCUMENT_URL</url>
    <text>DOCUMENT_TEXT</text>
    <entities>
        <entity>
            <type>DETECTED_TYPE</type>
            <relevance>DETECTED_RELEVANCE</relevance>
            <count>DETECTED_COUNT</count>
            <text>DETECTED_ENTITY</text>
            <disambiguated>
                <name>DISAMBIGUATED_ENTITY</name>
                <subType>ENTITY_SUBTYPE</subType>
                <website>WEBSITE</website>
                <geo>LATITUDE LONGITUDE</geo>
                <dbpedia>LINKED_DATA_DBPEDIA</dbpedia>
                <yago>LINKED_DATA_YAGO</yago>
                <opencyc>LINKED_DATA_OPENCYC</opencyc>
                <umbel>LINKED_DATA_UMBEL</umbel>
                <freebase>LINKED_DATA_FREEBASE</freebase>
                <ciaFactbook>LINKED_DATA_FACTBOOK</ciaFactbook>
                <census>LINKED_DATA_CENSUS</census>
                <geonames>LINKED_DATA_GEONAMES</geonames>
                <musicBrainz>LINKED_DATA_MUSICBRAINZ</musicBrainz>
                <crunchbase>CRUNCHBASE_WEB_LINK</crunchbase>
                <semanticCrunchbase>LINKED_DATA_CRUNCHBASE</semanticCrunchbase>
            </disambiguated>
            <quotations>
                <quotation>ENTITY_QUOTATION</quotation>
            </quotations>
        </entity>
    </entities>
</results>

Response Format (JSON):

{
    "status": "REQUEST_STATUS",
    "language": "DOCUMENT_LANGUAGE",
    "url": "DOCUMENT_URL",
    "text": "DOCUMENT_TEXT",
    "entities": [
        "entity": {
            "type": "DETECTED_TYPE",
            "relevance": "DETECTED_RELEVANCE",
            "count": "DETECTED_COUNT",
            "text": "DETECTED_ENTITY"
            "disambiguated": {
                "name": "DISAMBIGUATED_ENTITY",
                "subType": "ENTITY_SUBTYPE",
                "website": "WEBSITE",
                "geo": "LATITUDE LONGITUDE",
                "dbpedia": "LINKED_DATA_DBPEDIA",
                "yago": "LINKED_DATA_YAGO",
                "opencyc": "LINKED_DATA_OPENCYC",
                "umbel": "LINKED_DATA_UMBEL",
                "freebase": "LINKED_DATA_FREEBASE",
                "ciaFactbook": "LINKED_DATA_FACTBOOK",
                "census": "LINKED_DATA_CENSUS",
                "geonames": "LINKED_DATA_GEONAMES",
                "musicBrainz": "LINKED_DATA_MUSICBRAINZ",
                "crunchbase": "CRUNCHBASE_WEB_LINK",
                "semanticCrunchbase": "LINKED_DATA_CRUNCHBASE"
            },
            "quotations": [
                {
                    "quotation": "ENTITY_QUOTATION"
                }
            ]
        }
    ]
}

Response Format (RDF):

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                 xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
                 xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
    <rdf:Description rdf:ID="DOCUMENT_HASH">
        <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
        <aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
        <aapi:Language>DOCUMENT_LANGUAGE</aapi:Language>
        <aapi:URL>DOCUMENT_URL</aapi:URL>
        <aapi:DocText>DOCUMENT_TEXT</aapi:DocText>
    </rdf:Description>
    <rdf:Description rdf:ID="DOCUMENT_HASH-ENTITY_NUM">
        <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#EntityOccurrences"/>
        <aapi:Doc>DOCUMENT_HASH</aapi:Doc>
        <aapi:EntityType>DETECTED_TYPE</aapi:EntityType>
        <aapi:Relevance>DETECTED_RELEVANCE</aapi:Relevance>
        <aapi:NumOccurs>DETECTED_COUNT</aapi:NumOccurs>
        <aapi:Name>DETECTED_ENTITY</aapi:Name>
        <aapi:Disambiguation>
            <rdf:Description rdf:about="#DOCUMENT_HASH-ENTITY_NUM">
                <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#Disambiguation"/>
                <aapi:Doc>DOCUMENT_HASH</aapi:Doc>
                <aapi:ResolvedName>DISAMBIGUATED_ENTITY</aapi:ResolvedName>
                <aapi:SubType>ENTITY_SUBTYPE</aapi:SubType>
                <aapi:URL>WEBSITE</aapi:URL>
                <aapi:Geo>LATITUDE LONGITUDE</aapi:Geo>
                <owl:sameAs rdf:resource="LINKED_DATA_DBPEDIA"/>
                <owl:sameAs rdf:resource="LINKED_DATA_YAGO"/>
                <owl:sameAs rdf:resource="LINKED_DATA_OPENCYC"/>
                <owl:sameAs rdf:resource="LINKED_DATA_UMBEL"/>
                <owl:sameAs rdf:resource="LINKED_DATA_FREEBASE"/>
                <owl:sameAs rdf:resource="LINKED_DATA_FACTBOOK"/>
                <owl:sameAs rdf:resource="LINKED_DATA_CENSUS"/>
                <owl:sameAs rdf:resource="LINKED_DATA_GEONAMES"/>
                <owl:sameAs rdf:resource="LINKED_DATA_MUSICBRAINZ"/>
                <owl:sameAs rdf:resource="LINKED_DATA_CRUNCHBASE"/>
            </rdf:Description>
        </aapi:Disambiguation>
        <aapi:Quotations>
            <rdf:Description rdf:about="#DOCUMENT_HASH-ENTITY_NUM">
                <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#Quotations"/>
                <aapi:Doc>DOCUMENT_HASH</aapi:Doc>
                <aapi:Quotation>ENTITY_QUOTATION</aapi:Quotation>
            </rdf:Description>
        </aapi:Quotations>
    </rdf:Description>
</rdf:RDF>

Response Format (REL-TAG Microformat [XML-embedded] ):

<results>
    <status>REQUEST_STATUS</status>
    <language>DOCUMENT_LANGUAGE</language>
    <url>REQUESTED_URL</url>
    <text>DOCUMENT_TEXT</text>
    <microformats>
        <a href="REQUESTED_BASE_URL/DETECTED_ENTITY" rel="tag">DETECTED_ENTITY</a>
        <a href="REQUESTED_BASE_URL/DETECTED_ENTITY" rel="tag">DETECTED_ENTITY</a>
    </microformats>
</results>

Response Format (REL-TAG Microformat [raw] ):

<a href="REQUESTED_BASE_URL/DETECTED_ENTITY" rel="tag">DETECTED_ENTITY</a>
<a href="REQUESTED_BASE_URL/DETECTED_ENTITY" rel="tag">DETECTED_ENTITY</a>

Response Fields:

field name field description
status success / failure status indicating whether the request was processed.

Possible values:
OK
ERROR
language detected language (english, french, ..) for the submitted content.
url http url information was requested for.
type the detected entity type.

Possible values: (click to see list)
relevance relevance score for a detected entity.

Possible values: (0.0 - 1.0)   [1.0 = most relevant]
count number of times an entity was seen within the source web page.
text the detected entity text.
disambiguated disambiguation information for the detected entity (sent only if disambiguation occurred)
disambiguation field field description
name the disambiguated entity name.
subType the disambiguated entity subType

SubTypes expose additional ontological mappings for a detected entity, such as identification of a Person as a Politician or Athlete.
website the disambiguated entity website.
geo latitude longitude

the disambiguated entity geographic coordinates.
dbpedia sameAs link to DBpedia for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
yago sameAs link to YAGO for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
opencyc sameAs link to OpenCyc for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
umbel sameAs link to UMBEL for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
freebase sameAs link to Freebase for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
ciaFactbook sameAs link to the CIA World Factbook for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
census sameAs link to the US Census for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
geonames sameAs link to Geonames for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
musicBrainz sameAs link to MusicBrainz for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
crunchbase website link to CrunchBase for the disambiguated entity.

Note: Provided only for entities that exist in CrunchBase.
semanticCrunchbase sameAs link to Semantic CrunchBase for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
quotations extracted quotations for the detected entity (sent only if quotations extraction is enabled)
field field description
quotation quotation extracted for a particular named entity.
statusInfo failure status information (sent only if "status" == "ERROR").

Possible values:
invalid-api-key
page-is-not-html

API Notes:

  • 1. Calls to HTMLGetRankedNamedEntities should be made using HTTP POST.
  • 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
  • 3. Posted HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.
  • 4. Language detection is performed on the retrieved document before attempting named entity extraction. A minimum of 15 characters of text must exist within the requested HTTP document to perform language detection.
  • 5. Documents containing less than 15 characters of text are assumed to be English-language content.
  • 6. Disambiguation of detected entities is enabled by default. Disambiguation information will be included for each entity that is successfully resolved.
  • 7. Entity extraction is currently supported for all languages listed on the language support page. Other foreign-language submissions will be rejected and an error response returned.
  • 8. Disambiguation and quotations extraction are currently available for English-language content only. Support for other languages is in development.


API Call: HTMLGetNamedEntities

 - return to top of page - 

Description: The HTMLGetNamedEntities call is utilized to extract named entities (people, companies, organizations, etc.) from a posted HTML document. AlchemyGrid will extract text from the posted HTML document (ignoring navigation links, advertisements, and other undesireable content), and perform entity extraction operations.

Endpoint: http://access.alchemyapi.com/calls/html/HTMLGetNamedEntities

Parameters:

http argument parameter description
apikey your private api key

(required parameter)
html HTML document content (must be uri-argument encoded)

(required parameter)
url HTML document URL

(optional parameter, must be uri-argument encoded)
outputMode desired API output format

Possible values:
xml (default)
json
rdf

(optional parameter)
disambiguate whether to disambiguate detected entities.

Possible values:
1 - enabled (default)
0 - disabled

(optional parameter)
linkedData whether to include Linked Data content links with disambiguated entities.

Possible values:
1 - enabled (default)
0 - disabled

(optional parameter. disambiguation must be enabled to utilize the linkedData feature.)
coreference whether to resolve he/she/etc coreferences into detected entities.

Possible values:
1 - enabled (default)
0 - disabled

(optional parameter)
quotations whether to enable quotations extraction.

Possible values:
1 - enabled
0 - disabled (default)

(optional parameter)
showSourceText whether to include the original 'source text' the entities were extracted from within the API response.

Possible values:
1 - enabled
0 - disabled (default)

(optional parameter)
sourceText where to obtain the text that will be processed by this API call.

AlchemyAPI supports multiple modes of text extraction: web page cleaning (removes ads, navigation links, etc.), raw text extraction (processes all web page text, including ads / nav links), visual constraint queries, and XPath queries.

Possible values:
cleaned_or_raw cleaning enabled, fallback to raw when cleaning produces no text (default)
cleaned operate on 'cleaned' web page text (web page cleaning enabled)
raw operate on raw web page text (web page cleaning disabled)
cquery operate on the results of a visual constraints query

Note: The 'cquery' http argument must also be set to a valid visual constraints query.
xpath operate on the results of an XPath query

Note: The 'xpath' http argument must also be set to a valid XPath query.
(optional parameter)
cquery a visual constraints query to apply to the web page.

Constraint queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.

(optional parameter, used when sourceText is set to 'cquery'. must be uri-argument encoded)
xpath an XPath query to apply to the web page.

XPath queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.

(optional parameter, used when sourceText is set to 'xpath'. must be uri-argument encoded)

Response Format (XML):

<results>
    <status>REQUEST_STATUS</status>
    <language>DOCUMENT_LANGUAGE</language>
    <url>DOCUMENT_URL</url>
    <text>DOCUMENT_TEXT</text>
    <entities>
        <entity>
            <type>DETECTED_TYPE</type>
            <start>START_POS</start>
            <end>END_POS</end>
            <text>DETECTED_ENTITY</text>
            <disambiguated>
                <name>DISAMBIGUATED_ENTITY</name>
                <subType>ENTITY_SUBTYPE</subType>
                <website>WEBSITE</website>
                <geo>LATITUDE LONGITUDE</geo>
                <dbpedia>LINKED_DATA_DBPEDIA</dbpedia>
                <yago>LINKED_DATA_YAGO</yago>
                <opencyc>LINKED_DATA_OPENCYC</opencyc>
                <umbel>LINKED_DATA_UMBEL</umbel>
                <freebase>LINKED_DATA_FREEBASE</freebase>
                <ciaFactbook>LINKED_DATA_FACTBOOK</ciaFactbook>
                <census>LINKED_DATA_CENSUS</census>
                <geonames>LINKED_DATA_GEONAMES</geonames>
                <musicBrainz>LINKED_DATA_MUSICBRAINZ</musicBrainz>
                <crunchbase>CRUNCHBASE_WEB_LINK</crunchbase>
                <semanticCrunchbase>LINKED_DATA_CRUNCHBASE</semanticCrunchbase>
            </disambiguated>
            <quotations>
                <quotation>ENTITY_QUOTATION</quotation>
            </quotations>
        </entity>
    </entities>
</results>

Response Format (JSON):

{
    "status": "REQUEST_STATUS",
    "language": "DOCUMENT_LANGUAGE",
    "url": "DOCUMENT_URL",
    "text": "DOCUMENT_TEXT",
    "entities": [
        "entity": {
            "type": "DETECTED_TYPE",
            "start": "START_POS",
            "end": "END_POS",
            "text": "DETECTED_ENTITY"
            "disambiguated": {
                "name": "DISAMBIGUATED_ENTITY",
                "subType": "ENTITY_SUBTYPE",
                "website": "WEBSITE",
                "geo": "LATITUDE LONGITUDE",
                "dbpedia": "LINKED_DATA_DBPEDIA",
                "yago": "LINKED_DATA_YAGO",
                "opencyc": "LINKED_DATA_OPENCYC",
                "umbel": "LINKED_DATA_UMBEL",
                "freebase": "LINKED_DATA_FREEBASE",
                "ciaFactbook": "LINKED_DATA_FACTBOOK",
                "census": "LINKED_DATA_CENSUS",
                "geonames": "LINKED_DATA_GEONAMES",
                "musicBrainz": "LINKED_DATA_MUSICBRAINZ",
                "crunchbase": "CRUNCHBASE_WEB_LINK",
                "semanticCrunchbase": "LINKED_DATA_CRUNCHBASE"
            },
            "quotations": [
                {
                    "quotation": "ENTITY_QUOTATION"
                }
            ]
        }
    ]
}

Response Format (RDF):

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                 xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
                 xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
    <rdf:Description rdf:ID="DOCUMENT_HASH">
        <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
        <aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
        <aapi:Language>DOCUMENT_LANGUAGE</aapi:Language>
        <aapi:URL>DOCUMENT_URL</aapi:URL>
        <aapi:DocText>DOCUMENT_TEXT</aapi:DocText>
    </rdf:Description>
    <rdf:Description rdf:ID="DOCUMENT_HASH-ENTITY_NUM">
        <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#EntityOccurrence"/>
        <aapi:Doc>DOCUMENT_HASH</aapi:Doc>
        <aapi:EntityType>DETECTED_TYPE</aapi:EntityType>
        <aapi:TextStartPos>START_POS</aapi:TextStartPos>
        <aapi:TextEndPos>END_POS</aapi:TextEndPos>
        <aapi:Name>DETECTED_ENTITY</aapi:Name>
        <aapi:Disambiguation>
            <rdf:Description rdf:about="#DOCUMENT_HASH-ENTITY_NUM">
                <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#Disambiguation"/>
                <aapi:Doc>DOCUMENT_HASH</aapi:Doc>
                <aapi:ResolvedName>DISAMBIGUATED_ENTITY</aapi:ResolvedName>
                <aapi:SubType>ENTITY_SUBTYPE</aapi:SubType>
                <aapi:URL>WEBSITE</aapi:URL>
                <aapi:Geo>LATITUDE LONGITUDE</aapi:Geo>
                <owl:sameAs rdf:resource="LINKED_DATA_DBPEDIA"/>
                <owl:sameAs rdf:resource="LINKED_DATA_YAGO"/>
                <owl:sameAs rdf:resource="LINKED_DATA_OPENCYC"/>
                <owl:sameAs rdf:resource="LINKED_DATA_UMBEL"/>
                <owl:sameAs rdf:resource="LINKED_DATA_FREEBASE"/>
                <owl:sameAs rdf:resource="LINKED_DATA_FACTBOOK"/>
                <owl:sameAs rdf:resource="LINKED_DATA_CENSUS"/>
                <owl:sameAs rdf:resource="LINKED_DATA_GEONAMES"/>
                <owl:sameAs rdf:resource="LINKED_DATA_MUSICBRAINZ"/>
                <owl:sameAs rdf:resource="LINKED_DATA_CRUNCHBASE"/>
            </rdf:Description>
        </aapi:Disambiguation>
        <aapi:Quotations>
            <rdf:Description rdf:about="#DOCUMENT_HASH-ENTITY_NUM">
                <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#Quotations"/>
                <aapi:Doc>DOCUMENT_HASH</aapi:Doc>
                <aapi:Quotation>ENTITY_QUOTATION</aapi:Quotation>
            </rdf:Description>
        </aapi:Quotations>
    </rdf:Description>
</rdf:RDF>

Response Fields:

field name field description
status success / failure status indicating whether the request was processed.

Possible values:
OK
ERROR
language detected language (english, french, ..) for the submitted content.
url http url information was requested for.
type the detected entity type.

Possible values: (click to see list)
start start offset (in bytes) of this entity in the text stream.

Only included in API responses for AlchemyAPI subscription users
end end offset (in bytes) of this entity in the text stream.

Only included in API responses for AlchemyAPI subscription users
text the detected entity text.
disambiguated disambiguation information for the detected entity (sent only if disambiguation occurred)
disambiguation field field description
name the disambiguated entity name.
subType the disambiguated entity subType

SubTypes expose additional ontological mappings for a detected entity, such as identification of a Person as a Politician or Athlete.
website the disambiguated entity website.
geo latitude longitude

the disambiguated entity geographic coordinates.
dbpedia sameAs link to DBpedia for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
yago sameAs link to YAGO for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
opencyc sameAs link to OpenCyc for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
umbel sameAs link to UMBEL for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
freebase sameAs link to Freebase for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
ciaFactbook sameAs link to the CIA World Factbook for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
census sameAs link to the US Census for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
geonames sameAs link to Geonames for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
musicBrainz sameAs link to MusicBrainz for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
crunchbase website link to CrunchBase for the disambiguated entity.

Note: Provided only for entities that exist in CrunchBase.
semanticCrunchbase sameAs link to Semantic CrunchBase for the disambiguated entity.

Note: Provided only for entities that exist in this linked data-set.
quotations extracted quotations for the detected entity (sent only if quotations extraction is enabled)
field field description
quotation quotation extracted for a particular named entity.
statusInfo failure status information (sent only if "status" == "ERROR").

Possible values:
invalid-api-key
page-is-not-html

API Notes:

  • 1. Calls to HTMLGetNamedEntities should be made using HTTP POST.
  • 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
  • 3. Posted HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.
  • 4. Language detection is performed on the retrieved document before attempting named entity extraction. A minimum of 15 characters of text must exist within the requested HTTP document to perform language detection.
  • 5. Documents containing less than 15 characters of text are assumed to be English-language content.
  • 6. Disambiguation of detected entities is enabled by default. Disambiguation information will be included for each entity that is successfully resolved.
  • 7. Entity extraction is currently supported for all languages listed on the language support page. Other foreign-language submissions will be rejected and an error response returned.
  • 8. Disambiguation and quotations extraction are currently available for English-language content only. Support for other languages is in development.


API Call: HTMLGetAnnotatedNamedEntityText

 - return to top of page - 

Description: The HTMLGetAnnotatedNamedEntityText call is utilized to annotate posted webpage text with detected named entities (people, companies, organizations, etc.). AlchemyGrid will process the posted HTML document, extracting text from the document structure (ignoring navigation links, advertisements, and other undesireable content), perform entity extraction operations, and annotate the original webpage text according to the provided variable substitution template.

Endpoint: http://access.alchemyapi.com/calls/html/HTMLGetAnnotatedNamedEntityText

Parameters:

http argument parameter description
apikey your private api key

(required parameter)
html HTML document content (must be uri-argument encoded)

(required parameter)
url HTML document URL

(optional parameter, must be uri-argument encoded)
template the annotation template to apply to the text we're marking up. this template is applied to each detected entity in the source text. Variable substitution is utilized to integrate entity information into the annotated text.

Supported substitution variables:
variable variable description
$ENTITY the detected named entity text
$ENCODED_ENTITY the detected named entity text (uri-argument encoded)
$TRIMMED_ENTITY the detected named entity text (punctuation-trimmed)
$ENCODED_TRIMMED_ENTITY the detected named entity text (punctuation-trimmed, uri-argument encoded)
$RESOLVED_ENTITY the resolved, disambiguated named entity
$ENCODED_RESOLVED_ENTITY the resolved, disambiguated named entity (uri-argument encoded)
$TYPE the detected named entity type

(optional parameter)
outputMode desired API output format

Possible values:
xml (default)
json
rdf

(optional parameter)
disambiguate whether to disambiguate detected entities.

Possible values:
1 - enabled (default)
0 - disabled

(optional parameter)
coreference whether to resolve he/she/etc coreferences into detected entities.

Possible values:
1 - enabled (default)
0 - disabled

(optional parameter)
quotations whether to enable quotations extraction.

Possible values:
1 - enabled
0 - disabled (default)

(optional parameter)
showSourceText whether to include the original 'source text' the entities were extracted from within the API response.

Possible values:
1 - enabled
0 - disabled (default)

(optional parameter)
sourceText where to obtain the text that will be processed by this API call.

AlchemyAPI supports multiple modes of text extraction: web page cleaning (removes ads, navigation links, etc.), raw text extraction (processes all web page text, including ads / nav links), visual constraint queries, and XPath queries.

Possible values:
cleaned_or_raw cleaning enabled, fallback to raw when cleaning produces no text (default)
cleaned operate on 'cleaned' web page text (web page cleaning enabled)
raw operate on raw web page text (web page cleaning disabled)
cquery operate on the results of a visual constraints query

Note: The 'cquery' http argument must also be set to a valid visual constraints query.
xpath operate on the results of an XPath query

Note: The 'xpath' http argument must also be set to a valid XPath query.
(optional parameter)
cquery a visual constraints query to apply to the web page.

Constraint queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.

(optional parameter, used when sourceText is set to 'cquery'. must be uri-argument encoded)
xpath an XPath query to apply to the web page.

XPath queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.

(optional parameter, used when sourceText is set to 'xpath'. must be uri-argument encoded)

Response Format (XML):

<results>
    <status>REQUEST_STATUS</status>
    <url>REQUESTED_URL</url>
    <language>DETECTED_LANGUAGE</language>
    <text>DOCUMENT_TEXT</text>
    <annotatedText>ANNOTATED_TEXT</annotatedText>
</results>

Response Format (JSON):

{
    "status": "REQUEST_STATUS",
    "url": "REQUESTED_URL",
    "language": "DETECTED_LANGUAGE",
    "text": "DOCUMENT_TEXT",/text>
    "annotatedText": "ANNOTATED_TEXT"
}

Response Format (RDF):

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                 xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
                 xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
    <rdf:Description rdf:ID="DOCUMENT_HASH">
        <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
        <aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
        <aapi:Language>DOCUMENT_LANGUAGE</aapi:Language>
        <aapi:URL>DOCUMENT_URL</aapi:URL>
        <text>DOCUMENT_TEXT</text>
        <aapi:AnnotatedDocText>ANNOTATED_TEXT</aapi:AnnotatedDocText>
    </rdf:Description>
</rdf:RDF>

Response Fields:

field name field description
status success / failure status indicating whether the request was processed.

Possible values:
OK
ERROR
language detected language (english, french, ..) for the submitted content.
url http url information was requested for.
language the detected language that the source text was written in.
annotatedText the source text annotated with all identified named entities. Text annotation is controlled by the provided template parameter.
statusInfo failure status information (sent only if "status" == "ERROR").

Possible values:
invalid-api-key
cannot-retrieve
page-is-not-html

API Notes:

  • 1. Calls to HTMLGetAnnotatedNamedEntityText should be made using HTTP POST.
  • 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
  • 3. Posted HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.
  • 4. Language detection is performed on the retrieved document before attempting text annotation. A minimum of 15 characters of text must exist within the requested HTTP document to perform language detection.
  • 5. Documents containing less than 15 characters of text are assumed to be English-language content.
  • 6. Disambiguation of detected entities is enabled by default. Disambiguation information will be made available for variable substitution for each entity that is successfully resolved.
  • 7. Entity extraction is currently supported for all languages listed on the language support page. Other foreign-language submissions will be rejected and an error response returned.
  • 8. Disambiguation and quotations extraction are currently available for English-language content only. Support for other languages is in development.


 - return to top of page -