Content Scraping: Web API
AlchemyAPI provides easy-to-use facilities for extracting structured data from your publicly-accessible web-based content. These URL processing calls automatically fetch the desired Internet webpage, and extract the desired structured data.
These API calls may be utilized to process hosted webpages, blogs, and other publicly-accessible Internet content. If you are processing content that is not hosted on a public webserver, use our HTML API calls instead.
API Call: URLGetConstraintQuery
Description: The URLGetConstraintQuery call is utilized to extract structured data from a given web page. AlchemyAPI will download the requested URL, analyze the HTML document structure, and perform structured data extraction operations.
Endpoint: http://access.alchemyapi.com/calls/url/URLGetConstraintQuery
Parameters:
| http argument | parameter description |
|---|---|
| url | http url (must be uri-argument encoded)
(required parameter) |
| apikey | your private api key
(required parameter) |
| cquery | document constraint query
(required parameter) |
| outputMode | desired API output format Possible values: xml (default) json rdf (optional parameter) |
| jsonp | desired JSONP callback (optional parameter, requires "outputMode" to be set to json) |
Response Format (XML):
<results>
<status>REQUEST_STATUS</status>
<url>REQUESTED_URL</url>
<queryResults>
<queryResult>
<resultText>DETECTED_TEXT</resultText>
<resultURL>DETECTED_URL</resultURL>
</queryResult>
</queryResults>
</results>
Response Format (JSON):
{
"status": "REQUEST_STATUS",
"url": "REQUESTED_URL",
"queryResults": [
{
"resultText": "DETECTED_TEXT",
"resultURL": "DETECTED_URL"
}
]
}
Response Format (RDF):
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
<rdf:Description rdf:ID="DOCUMENT_HASH">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
<aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
<aapi:URL>DOCUMENT_URL</aapi:URL>
<aapi:CQueryResultText>DETECTED_TEXT</aapi:CQueryResultText>
<aapi:CQueryResultURL>DETECTED_URL</aapi:CQueryResultURL>
</rdf:Description>
</rdf:RDF>
Response Fields:
| field name | field description |
|---|---|
| status | success / failure status indicating whether the request was processed. Possible values: OK ERROR |
| url | http url information was requested for. |
| resultText | some extracted text (structured data). |
| resultURL | an extracted URL (structured data). |
| statusInfo | failure status information (sent only if "status" == "ERROR"). Possible values: invalid-api-key cannot-retrieve page-is-not-html |
Example Calls:
XML: http://access.alchemyapi.com/calls/...
RDF: http://access.alchemyapi.com/calls/...
API Notes:
- 1. Calls to URLGetConstraintQuery can be made using HTTP GET or POST.
- 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
- 3. URL retrieval is attempted for a maximum of 10 seconds. Requests taking longer than this will result in a "cannot-retrieve" error response.
- 4. Requested HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.
- return to top of page -
