Back to Top

 

Query Language: Content Scraping

Query Language: Content Scraping

AlchemyAPI enables web pages to be mined using simple, natural language queries. Knowledge of HTML, DOM, or other web technologies is not required.

Visual Constraints

AlchemyAPI's visual constraint query engine is a powerful tool for extracting structured data from any web page. Constraints enable content to be identified using visual characteristics such as text labels & patterns, positioning within the web page, structural encapsulation, and more. Mining structured data via visual constraints is robust against changes in underlying HTML document structure, providing superior content extraction capabilities when compared to document query technologies such as XPath.


Using Text Patterns

Text patterns enable the retrieval of content via the text contained within or surrounding items in a web page. Patterns make is easy to extract information from a web page based on textual labels, key phrases, etc. The following example illustrates extraction of web page elements using a specific pattern (in this case, retrieving all web page elements matching the 'Obama' text pattern):

Text patterns may optionally be enclosed within single / double quotes to improve readability of the query. Usage of quotes is not required unless a text pattern contains query language reserved words ('inside', 'before', 'after', etc.). Patterns may be combined with other visual constraints such as counting offsets, element names, and before / after clauses to construct more complex queries.


Selecting Specific Items (Counting Offsets)

Counting offsets enable the retrieval of specific elements within a web page via their positioning. Counts make is easy to extract the 1st, 2nd, 3rd, 20th, or other element via a numeric offset. The following example illustrates extraction of web page elements using a specific counting offset (in this case, retrieving the 3rd web page element that matches the 'Obama' text pattern):

Counting offsets may be combined with other visual constraints such as element names, before / after clauses, and containing / inside clauses to construct more complex queries.


Selecting Multiple Items (Multi Selections)

Selection counts enable the retrieval of a specific number of elements within a web page. This makes it easy to extract 1, 5, 10, 100, or any other quantity of elements via a numeric count. The following example illustrates extraction of a specific number of web page elements (in this case, retrieving '3' web page elements matching the 'Obama' text pattern):

To select all elements matching a set of constraints, a selection count may be omitted, or the "all" keyword may be utilized (Example query: all 'texas' paragraphs).

Selection counts may be combined with other visual constraints such as element names, before / after clauses, and containing / inside clauses to construct more complex queries.


Refining Queries Using Element Names

Element names enable the targeted retrieval of specific web page elements based on their HTML tag name, or their "common name". Element names enable queries to be refined to focus on specific items within a web page such as links, paragraphs of text, items in a list, etc. The following example illustrates extraction of a specific type of web page element (in this case, retrieving '4' web page elements matching the 'tetris' text pattern that are also LINK elements [HTML A tag]):

Common names (aliases) may be utilized to select web page elements in a natural language fashion. Supported common name mappings include:

HTML TagCommon Name(s)
Alink
DIVarea, region
EMBEDvideo, movie
IMGimage, photo, picture
LIlist-item
Pparagraph, line
TDcolumn, element
TRrow

Element names may be combined with other visual constraints such as counting offsets, before / after clauses, and containing / inside clauses to construct more complex queries.


Querying Inside Tables (Selecting Rows)

Rows within any table may be selected based on visual constraints such as text patterns and counting offsets. The following example illustrates extraction of a specific table row (in this case, retrieving a row matching the 'list price' text pattern):

Row selections may be combined with other visual constraints such as text patterns, counting offsets, and multi selections to construct more complex queries.


Querying Inside Tables (Selecting By Column Number)

Columns within any table may be selected based on visual constraints such as counting offsets. The following example illustrates extraction of a specific table column (in this case, retrieving the 2nd column in the table matching the 'leaderboard' text pattern):

Column selections may be combined with other visual constraints such as text patterns and multi selections to construct more complex queries.


Querying Inside Tables (Selecting By Column Name)

Columns within any table may be selected based on visual constraints such as text patterns. Text patterns enable retrieval of columns based on a specific table heading, text found within a column, etc. The following example illustrates extraction of a specific table column (in this case, retrieving the column matching the 'best score' text pattern in the table matching the 'leaderboard' text pattern):

Column selections may be combined with other visual constraints such as counting offsets and multi selections to construct more complex queries.


Querying Inside Tables (Selecting a Column Inside a Specific Row)

Columns within any specific table row may be selected based on visual constraints such as text patterns and counting offsets. The following example illustrates extraction of a column within a specific table row (in this case, retrieving the 2nd column in a table row matching the 'list price' text pattern):

'Column inside row' selections may be combined with other visual constraints such as counting offsets and multi selections to construct more complex queries.


Position Constraints: Selecting Before an Item

Selections may be restricted to specific portions of a document using the 'Before' operator. This operator limits query results to those items occurring before a specified web page element. The following example illustrates extraction of content using the 'Before' operator (in this case, retrieving links occurring before the 1st link matching the 'research' text pattern):

'Before' selections may be combined with other visual constraints such as text patterns and counting offsets to construct more complex queries.


Position Constraints: Selecting After an Item

Selections may be restricted to specific portions of a document using the 'After' operator. This operator limits query results to those items occurring after a specified web page element. The following example illustrates extraction of content using the 'After' operator (in this case, retrieving links occurring after the 1st link matching the 'crunchbase' text pattern):

'After' selections may be combined with other visual constraints such as text patterns and counting offsets to construct more complex queries.


Combining Counting Offsets + Before / After

'Before' / 'After' queries may be refined via the usage of counting offsets and multi selections to select ranges of items from within a web page. The following example illustrates range-restricted extraction of content using the 'After' operator and counting offsets (in this case, retrieving the 1st 10 links occurring after the web page element matching the 'latest news' text pattern):

Range-restricted 'Before' / 'After' selections may be combined with other visual constraints such as the containing and inside operators to construct more complex queries.


Encapsulation Constraints: Containing

Selections may be restricted to items that contain another specified item using the 'Containing' operator. The following example illustrates extraction of content using the 'Containing' operator (in this case, retrieving an area within the web page that contains an element matching the 'all entertainment' text pattern):

'Containing' selections may be combined with other visual constraints such as text patterns and counting offsets to construct more complex queries.


Encapsulation Constraints: Inside

Selections may be restricted to items occurring inside another item using the 'Inside' operator. The following example illustrates extraction of content using the 'Inside' operator (in this case, retrieving all links within the web page element matching the 'just in' text pattern):

'Inside' selections may be combined with other visual constraints such as text patterns and counting offsets to construct more complex queries.


Encapsulation Constraints: Combining Inside + Containing

Powerful web page queries may be constructed by combining the 'Inside' and 'Containing' operators. Chaining operators in this manner enables web page selections to be refined to focus on items that match specific visual and structural constraints. The following example illustrates extraction of content using the combination of the 'Inside' operator and 'Containing' operator (in this case, retrieving all links within an area in the web page that contains an element matching the 'all entertainment' text pattern):

'Inside + Containing' selections may be combined with other visual constraints such as text patterns and counting offsets to construct more complex queries.


Combining Patterns + Counts + Inside + Containing

Powerful web page queries may be constructed by combining multiple operators such as text patterns, counting offsets, inside, and containing. The following example illustrates extraction of content using the combination of the text patterns, counting offsets, the 'Inside' operator and the 'Containing' operator (in this case, retrieving the 1st 4 links after an element matching the 'video' pattern within an area in the web page that contains an element matching the 'all entertainment' text pattern):

Chaining operators in this fashion enables the easy construction of queries that employ complex visual and structural constraints.


Extracting Keywords + Phrases From Query Results

Visual constraints may be used to perform targeted keyword / phrase extraction operations on specific portions of a web page, such as product descriptions and story titles. The following example illustrates targeted extraction of keywords / phrases using a simple visual constraint query (in this case, extracting keywords + phrases from all web page elements matching the 'Obama' text pattern):

For more information on performing keyword + phrase extraction from targeted webpage elements using visual constraints, see the AlchemyAPI Keyword API documentation. Note: Both the 'sourceText' and 'cquery' parameters must be properly configured in all API calls to utilize this feature.


Extracting Named Entities From Query Results

Visual constraints may be used to perform targeted named entity extraction operations on specific portions of a web page, such as product descriptions and story titles. The following example illustrates targeted extraction of named entities using a simple visual constraint query (in this case, extracting named entities from all web page elements matching the 'Obama' text pattern):

For more information on performing named entity extraction from targeted webpage elements using visual constraints, see the AlchemyAPI Entity API documentation. Note: Both the 'sourceText' and 'cquery' parameters must be properly configured in all API calls to utilize this feature.


Detecting Language of Query Results

Visual constraints may be used to perform targeted language detection operations on specific portions of a web page, such as product descriptions and story titles. The following example illustrates targeted language detection using a simple visual constraint query (in this case, detecting the language of the 1st 6 links found before a web page element matching the 'flash info' text pattern):

For more information on performing language detection on targeted webpage elements using visual constraints, see the AlchemyAPI Language API documentation. Note: Both the 'sourceText' and 'cquery' parameters must be properly configured in all API calls to utilize this feature.


Topic Categorization of Query Results

Visual constraints may be used to perform targeted topic categorization operations on specific portions of a web page, such as story titles and descriptions. The following example illustrates targeted topic categorization using a simple visual constraint query (in this case, performing topic categorization on all the links found in an area in the web page matching the 'just in' text pattern):

For more information on performing topic categorization on targeted webpage elements using visual constraints, see the AlchemyAPI Categorization API documentation. Note: Both the 'sourceText' and 'cquery' parameters must be properly configured in all API calls to utilize this feature.


My Query Isn't Working

Visual constraints are a powerful mechanism for querying web pages based on their visual and structural characteristics. AlchemyAPI's visual constraints engine employs advanced heuristics and statistical modeling approaches. Due to the heuristic nature of visual constraints and the un-exact nature of constraint queries, you may end up formulating a query that appears to be valid, but does not return the expected results. If this occurs, try re-wording or modifying your query. The following image illustrates a visual constraint query ("isbn-13") and various examples of how this query may be reformulated:

If your query is still not working after attempted troubleshooting, please feel free to contact the AlchemyAPI Support Team for more assistance. XPath may also be utilized as a data extraction mechanism by most AlchemyAPI calls (see the "sourceText" and "xpath" URI parameters).