Fetching Large Amounts of Data

Scenario – The user wants to retrieve a large amount of historical data in the most efficient and polite manner.

Background – The general strategy is to set max_matches to the highest value allowed for your API key. When the number of results in <TotalFound> exceeds this limit, break the query up by using consecutive date ranges. Design the querying logic to generate multiple result sets that don't exceed the Max Matches limit. For each result set use limit/offset to iteratively fetch those matches.

Examples:

  • Find the number of results from the past 3 months containing the terms 'android' and 'mobile':

    api.boardreader.com/v1/Boards/Search?query=android%20mobile&key=YOURKEYHERE&filter_language=en&limit=1

    The <TotalFound> value API response is 76,733 and the user wishes to download all the matches for further analysis.

  • Assume the Max Matches limit for the user is configured to be 10k. A valid strategy would be to run a series of weekly queries that span the three month period. Each query would likely contain less than 10k results and the matches could be retrieved by iterating through each result set using the limit and offset parameters.
    We can check this:

    api.boardreader.com/v1/Boards/Search?query=android%20mobile&key=YOURKEYHERE&filter_language=en&limit=100&group_by=date&group_date=week

    … the results confirm that weekly counts over the past 3 months never exceed 10k

  • Now get the matches for the first week of the desired date range (e.g. Mon, 06 Jun 2016 00:00:00 GMT to Sun, 12 Jun 2016 23:59:59 GMT):

    api.boardreader.com/v1/Boards/Search?query=android%20mobile&key=YOURKEYHERE&filter_language=en&filter_date_from=1465171200&filter_date_to=1465775999&max_matches=10000&limit=100&offset=0

    <TotalFound>7364</TotalFound>

  • After the first week's matches are fetched, the filter_date_from and filter_date_to parameters can be incremented one week (Mon, 13 Jun 2016 00:00:00 GMT to Sun, 19 Jun 2016 23:59:59 GMT) and the process repeated until all the matches for the past three months were downloaded:

    api.boardreader.com/v1/Boards/Search?query=android%20mobile&key=YOURKEYHERE&filter_language=en&filter_date_from=1465776000&filter_date_to=1466380799&max_matches=10000&limit=100&offset=0

    <TotalFound>7950</TotalFound>


Description of max_matches caching algorithm

  • The cache lifetime is 5 minutes; i.e. when a user submits a request like this:

    api.boardreader.com/v1/Boards/Search?query=google&key=YOURKEYHERE&max_matches=10000&mode=full

    ... the whole set (up to 10000 documents in this case) is cached and user has 5 minutes to work with it using any limit/offset values. After that the cache may be purged.


If you're submitting queries only to get counts and/or don't intend to retrieve all the possible matches, there's no need to use high Max Matches values. Using high Max Matches increases per-query RAM usage and when not required wastes system resources. In such cases you can omit max_matches from your request and the API will use the default of 1000.

How can I implement keyword monitoring using the APIs

Scenario - Monitor Boardreader on an ongoing basis for new results that match a static keyword or list of keywords.

Background - The aggregation of message board data is primarily done via traditional web crawling. At this point in time, Message Board technologies and platforms have not evolved to the point they can be considered part of the real-time web. Therefore, users should not expect a real-time search experience as they would encounter with Twitter.com, for example. Most Message Boards are crawled daily, some multiple times per day and a small minority every few days or even longer. Users should be realistic and understand that latencies with indexing new message board posts are greater than other content types (e.g. blogs and news).

Strategies:

  • Determine what a realistic monitoring frequency is. For example, hourly may be realistic, but every 10 seconds is not.

  • Utilize the filter_inserted_from and filter_inserted_to parameters to constrain results to a time slice that matches the desired monitoring frequency.

  • Determine if newly indexed posts with older publication dates are wanted in search results. This can occur when a new Message Board is added to the coverage and a historical crawl gathers all existing posts. If older posts are not wanted, utilize the filter_date_from parameter in the recurring query.

  • Adjust timestamp values to account for the lag that occurs between when a post is inserted in the BoardReader database and when it becomes available in the search index. The incremental index is normally refreshed every 5 minutes, but the exact amount of time to complete a refresh is not predictable. Therefore, to avoid missing data, offset timestamp values adequaltely to ensure inserted data for the time slice being queried has been included in the index.

Example:

Find all newly collected Message Board posts that mention the term 'iphone'. Ignore posts with publication dates older than 7 days.

We will plan to rerun this query on an hourly basis and use a one hour offset (from the date/time when the query is actually run). The example below assumes that local time is EDT and the API request will be sent at 1PM. The following values will return matches inserted between 1 and 2 hours ago.

filter_inserted_from=1342018800 (Wed, 11 Jul 2012 15:00:00 UTC)
filter_inserted_to=1342022400 (Wed, 11 Jul 2012 16:00:00 UTC)
filter_date_from=1341414000 (Wed, 4 Jul 2012 15:00:00 UTC)

api.boardreader.com/v1/Boards/Search?query=iphone&body=full_text&match_mode=boolean&key=YOURKEYHERE&offset=0&limit=100&mode=full&highlight=0&filter_inserted_from=1515596400&filter_inserted_to=1515600000&filter_date_from=1514991600

Stop Words

You can't search for a stop word

If a stop word is used in a phrase like "I love Disney":

The stop word itself is "dropped", but the index treats stop words as generic place holders. So we know there was some token prior to "love disney", but we don't know what that token was. So what's effectively happening is we're returning matches containing "{anytoken} love disney". With this query {anytoken} is often the intended token "I", but not always. It's important to remember {anytoken} can literally be any token, including a stop word.

Changing match_mode to phrase won't help. You can't search for a stop word because they aren't included in the index, regardless of match_mode.

  • example phrase with stop word... "loved the hotel":

    api.boardreader.com/v1/Reviews/Search?match_mode=extended&filter_country=us&query=%22loved%20the%20hotel%22&rt=xml&highlight=0&max_matches=200&sort_mode=time&filter_site_key=0&body=full_text&filter_date_from=1&key=YOURKEYHERE&limit=1&mode=full&filter_language=en&offset=0

    <TotalFound>103290</TotalFound>

  • example phrase with stop word removed... "loved hotel":

    api.boardreader.com/v1/Reviews/Search?match_mode=extended&filter_country=us&query=%22loved%20hotel%22&rt=xml&highlight=0&max_matches=200&sort_mode=time&filter_site_key=0&body=full_text&filter_date_from=1&key=YOURKEYHERE&limit=1&mode=full&filter_language=en&offset=0

    <TotalFound>2288</TotalFound> ... this query generates a different result set than above; so clearly we are not completely ignoring the stop word... it definitely matters

  • example with different stop word in same position... "loved a hotel":

    api.boardreader.com/v1/Reviews/Search?match_mode=extended&filter_country=us&query=%22loved%20a%20hotel%22&rt=xml&highlight=0&max_matches=200&sort_mode=time&filter_site_key=0&body=full_text&filter_date_from=1&key=YOURKEYHERE&limit=1&mode=full&filter_language=en&offset=0

    <TotalFound>103290</TotalFound> # illustrates the stop word was just treated as any token

  • example with two stop words... "loved a my hotel":

    api.boardreader.com/v1/Reviews/Search?match_mode=extended&filter_country=us&query=%22loved%20a%20my%20hotel%22&rt=xml&highlight=0&max_matches=200&sort_mode=time&filter_site_key=0&body=full_text&filter_date_from=1&key=YOURKEYHERE&limit=1&mode=full&filter_language=en&offset=0

    <TotalFound>17103</TotalFound> # returns matches like <Title>Loved that the hotel entrance</Title>

Stop word list

The following terms are treated as stop words and are not included in the BoardReader search index:

  • the

  • to

  • i

  • and

  • a

  • this

  • you

  • of

  • it

  • in

  • that

  • is

  • s

  • my

  • t

  • on

  • for

  • re

  • but

  • be

  • was

  • so