fbpx

Google is now using its RankBrain machine learning system to process every query that the search engine handles, and the system is changing the rankings of lots of queries.

The news emerged this week as part of Steven Levy’s Backchannel story about machine learning efforts at Google. From the story, in regard to RankBrain:

Google is characteristically fuzzy on exactly how it improves search (something to do with the long tail? Better interpretation of ambiguous requests?) but Dean says that RankBrain is “involved in every query,” and affects the actual rankings “probably not in every query but in a lot of queries.”

What’s more, it’s hugely effective. Of the hundreds of “signals” Google search uses when it calculates its rankings (a signal might be the user’s geographical location, or whether the headline on a page matches the text in the query), RankBrain is now rated as the third most useful.

We’ve already heard before that RankBrain is considered the most useful search ranking signal, behind content and links. But prior to this, Google had only said publicly last October that RankBrain was used to process a “large fraction” of 15 percent of the searches it had never seen before.

In short: Google’s clearly become so confident in RankBrain’s mysterious capabilities that it’s now used to help with every query that the search engine handles, more than two trillion per year.

What RankBrain may be doing

What’s not happening is that RankBrain actually changes the rankings of search results for all those queries, but rather for “a lot” of them, as stated. How can that be?

That fits in with what we’ve understood about RankBrain: it seems largely used as a query refinement tool. Google seems to be using it now for every search to better understand what that search is about. After that, another aspect of RankBrain might influence what results actually appear and in what order, but not always.

Imagine that RankBrain sees a search for “best flower shop in Los Angeles.” It might understand that this is similar to another search that’s perhaps more popular, such as “best LA flower shops.” If so, it might then simply translate the first search behind the scenes into the second one. It would do that because for a more popular search, Google has much more user data that helps it feel more confident about the quality of the results.

In the end, RankBrain did change the ranking of those results. But it did that simply because it triggered a different search, not because it used some special ranking factor to influence which exact listing appeared in what order.

Having said that, Google has said that RankBrain also is used as an actual ranking signal, repeating that yesterday at our SMX Advanced show.

For the SEO and search marketers worried about what they should do now that RankBrain has ramped up, the answer remains the same: nothing, but focus on great content. Even people at Google don’t quite understand how RankBrain does what it does, we’ve been told. Honest. But it’s ultimately designed to reward great content. So focus on that, which has always been the case with SEO, and you’re on the right track.

Source:  http://searchengineland.com/google-loves-rankbrain-uses-for-every-search-252526

Categorized in Search Engine

Jordan Koene is a SEJ Summit veteran, having spoke at a few of our search marketing conferences last year. This year, we’re happy to have him at SEJ Summit Chicago, speaking on how to improve search visibility.

Jordan’s insights below are always enlightening and cover everything from moving past a plateau to how e-commerce SEO is different from other channels.

Your SEJ Summit presentation is titled Surviving the Search Plateau: 3 Tactics to Bring Your Website’s SEO Visibility to New Heights. How do you determine if you are in an SEO plateau? What signs would you look for?

You’ve plateaued if you reach a period where, despite your efforts, you’ve been unable to affect positive change on your site – usually quarterly for most businesses. It usually presents itself either in slow downs in site traffic or declines in conversion rates. Traffic is the more obvious metric, since most SEO teams are measured by it, but there are times you may see an increase in clicks that don’t reflect in your total conversions. That bears investigating.

One of the examples you give for breaking free from the plateau is by igniting your content. Does that mean blending content marketing into your SEO strategy?

That can be a piece of it, though that can take a lot of time and money. From a search perspective, the low-hanging fruit is to simply refresh the content you already have with new material, or by making minor changes. Like layering a cake, you can build on top of your old content with structured data or info to create something interesting and new. Minor changes can bring big rewards.

I did a little bit of stalking and saw you are interested in wearable technology. What is your favorite wearable piece of tech—either already on the market or coming soon?

Personally, I’m really interested in the Internet of Things – items within the home like Nest or Ring that are beginning to talk to each other and to you. Similar to how 3-4 years ago, when wearables for fitness like Fitbit started to provide us with data around our health and well-being to aid self improvement, we’re now starting to see that same thought process transition into devices for the home, helping make utilitarian improvements to the way we live., These kinds of futuristic gadgets can solve a lot of problems for our world like reducing consumption of fossil fuels and other things that have a direct impact on our environment.

You have a background in e-commerce, having worked for eBay in the past. How does SEO differ for big e-commerce brands versus, say, a service based brand.

E-commerce has this mentality of short-term gains: everything is about making short-term progress in a competitive ecosystem, especially here in the US. For that reason, a good deal of the decision making is relatively short-sighted, and you might not see them invest in long-term plays like you would for a news or media outlet. Service-based companies are more focused on having an online to offline presence since they essentially evolved from the big directory business.

A lot of service companies are moving into a transactional service model to marry in e-commerce behaviors, like Yelp, which now offers a bidding service for consumers looking to nail down a service for a particular price. In that way, they’re becoming more similar as more companies adopt that model.

Bonus Question: What was the last book you read?

I’m currently starting Shoe Dog by Phil Knight. I’ve been interested in selling in an era where e-commerce didn’t exist, and was looking for parallels into how shopping is changing today. People like Phil Knight are pioneers who broke down lots of barriers in the market to rise to success, but it’s interesting to dig into how much of his success was based on societal changes at the time – and how societal changes today might reflect market changes to come. 

Categorized in Online Research

A prospective client had something to hide when she claimed no previous involvement in an industry rife with fraud. This claim stated in conjunction with the submission of an informed business plan rang false. Other clues about her integrity worried the lawyer. He soon suspected that she was a dishonest person. After the meeting, he consulted another partner, who in turn delivered the puzzle to my e-mail inbox. My mission was to fit the mismatched pieces of information together, either substantiating or disproving the lawyer's skepticism.

Internet Archive to the Rescue

Wanting to emphasize the importance of retaining knowledge of history, George Santayana wrote the words made famous by the film, Rise and Fall of the Third Reich--"Those who cannot remember the past are condemned to repeat it." Of course, at the time the Internet Archive didn't exist; nor did the Information Age. If it had, perhaps he would have edited his philosophy to state, "Those who cannot discover the past are condemned to repeat it."

Certainly in times when new information amounts to five exabytes, or the equivalent of "information contained in half a million new libraries the size of the Library of Congress print collections" (How Much Information 2003?), it is perhaps fortunate that librarians possess a knack for discovering information. It is also in our favor that Brewster Kahle and Alexa Internet foresaw a need for an archive of Web sites.
Internet Archive and the Wayback Machine

Founded in 1996, the Internet Archive contains about 30 billion archived Web pages. While always open to researchers, the collection did not become readily accessible until the introduction of the Wayback Machine in 2001. The Wayback Machine enables finding archived pages by their Web address. Enter a URL to retrieve a dated listing of archived versions. You can then display the archived document as well as any archived pages linked from it.

The Internet Archive helped me successfully respond to the concerns the lawyers had about the prospective client. It contained evidence of a business relationship with a company clearly in the suspect industry. Broadening the investigation to include the newly discovered company led to information about an active criminal investigation.

Suddenly, the pieces of the puzzle came together and spelled L-I-A-R.
Using the Internet Archive should be a consideration for any research project that involves due diligence, or the careful investigation of someone or something to satisfy an obligation. In addition to people and company investigations, it can assist in patent research for evidence of prior art, or copyright or trademark research for evidence of infringement. It can also come in handy when researching events in history, looking for copies of older documents like superseded statutes or regulations, or when seeking the ideals of a former political administration. (Note: 25 October 2004.

A special keyword search engine, called Recall Search, facilitates some of these queries. Unfortunately, it was removed from the site during mid-September. Messages posted in the Internet Archive forum indicate they plan to bring it back. Note: 15 June 2007. I think it's safe to assume that Recall Search is not coming back. However, check out the site for developments in searching archived audio (music), video (movies) and text (books).)

Recall Search at the Internet Archive

But while the Internet Archive contains information useful in investigative research, finding what you want within the massive collection presents a challenge. If you know the exact URL of the document, or if you want to examine the contents of a specific Web site--as was the case in the scenario involving the prospective client--then the Wayback Machine will suffice. But searching the Internet Archive by keyword was not an option until recently. (Note: See the note in the previous paragraph.)

During September 2003, the project introduced Recall Search, a beta version of a keyword search feature. Recall makes about one-third, or 11 billion, Web pages in the archived collection accessible by keyword. While it further facilitates finding information in the Internet Archive, it does not replace the Wayback Machine. Because of the limited size of the keyword indexed collection and the problems inherent in keyword searching, due diligence researchers should use both finding tools.
Recall does not support Boolean operators. Instead, enter one or more keywords (fewer is probably better) and, if desired, limit the results by date.

Results appear with a graph that illustrates the frequency of the search terms over time. It also provides clues about their context. For example, a search for my name limited to Web pages collected between January 2002 and May 2003 finds ties to the concepts, "school of law," "government resources," "research site," "research librarian," "legal professionals" and "legal research." The resulting graph further shows peaks at the beginning of 2002 and in the spring of 2003.

Applying content-based relevancy ranking, Recall also generates topics and categories. Little information exists about how this feature works, and I have experienced mixed results. But the idea is to limit results by selecting a topic or category relevant to the issue.

Suppose you enter the keyword, Microsoft. The right side of the search results page suggests concepts for narrowing the query. For example, it asks if instead you mean Microsoft Windows, Microsoft Internet Explorer, Microsoft Word, and so on. Likewise, a search for turkey suggests wild turkey, the country of Turkey, turkey hunting, roast turkey and other interpretations.

While content-based relevancy ranking can be a useful algorithm, it is far from perfect. Some topics and categories generated might not seem to make sense. If the queries you run do not produce satisfactory results, consider another approach.

Pinpoint the specific sites you want to investigate by first conducting the research on the Web. In the prospective client example, an old issue of the newsletter of the company under criminal investigation (Company A) mentioned the prospective client's company (Company B). This clue led us to Company A's Web site where we found no further mention of Company B. However, with the Web site address in hand, we reviewed almost every archived page at the Internet Archive and found solid evidence of a past relationship. Additional research, during which we tracked down court records and spoke to one of the investigators, provided the verification we needed to confront the prospective client.

Advanced Search Techniques

You can display all versions of a specific page or Web site during a certain time period by modifying the URL. Greg Notess first illustrated this strategy in his On The Net column (See "The Wayback Machine: The Web's Archive," Online, March/April 2002).
A request for all archived versions of a page looks like this:
http://web.archive.org/web/*/http://www.domain.com

The asterisk is a wildcard that you can modify. For example, to find all versions from the year 2002, you would enter:
http://web.archive.org/web/2002*/http://www.domain.com
Or to find all versions from September 2002, you would enter:
http://web.archive.org/web/200209*/http://www.domain.com
Sometimes you encounter problems when you browse pages in the archive. For example, I often receive a "failed connection" error message. This may be the result of busy Web servers or a problem with the page. It may also occur if the live Web site prohibits crawlers.

To find out if the latter issue is the problem, check the site's robot exclusion file. A standard honored by most search engines, the robot exclusion file resides in the root-level directory. To find it, enter the main URL in your browser address line followed by robots.txt. Like this: http://www.domain.com/robots.txt .
If the site blocks the Internet Archive's crawler, it will contain two lines of text similar to the following:
User-agent: ia_archiver
Disallow: /
If it forbids all crawlers, the commands should look like this:
User-agent: *
Disallow: /

It's common for Web sites to block crawlers, including the Internet Archive, from indexing their copyrighted images and other non-text files. If the Internet Archive blots out images with gray boxes, then the Web site probably prevents it from making the graphics available.

If the site does not appear to block the Internet Archive, don't give up when you encounter a "failed connection" message. Return to the Wayback Machine and enter the Web page address. This strategy generates a list of archived versions of the page whereas Recall presents specific matches to a query. One of the other dated copies of the page may load without problems.

Conclusion

While the Internet Archive does not contain a complete archive of the Web, it offers a significant collection that due diligence researchers should not overlook. Tools like the Wayback Machine and Recall Search provide points of access. However, these utilities only handle simple queries. You can search by Web page address or keyword. You cannot conduct Boolean searching or limit a query by key information. Moreover, Recall Search limits keyword access to one-third of the collection. Consequently, conduct what research you can elsewhere first using public Web search engines and commercial sources. Then use the information you discover to scour relevant sites in the Internet Archive.

Source:  http://virtualchase.justia.com/content/internet-archive-and-search-integrity

Categorized in Science & Tech

A web browser is a software used to access the internet. It is the link between a user and the internet. A browser fetches the information from the internet, with the help of the URL that we provide, and displays it to the user. Most people are familiar with the “Big Five” in the browser industry: Chrome, Firefox, Internet Explorer, Safari and Opera. Although each of them have their pros and cons, 95% of the population uses one of these browsers. One may wonder what makes them so popular, when there are many other options available in the market. It is perhaps the ease of use, the easy availability and the fact that most of these are customizable with various add-ons and extensions that have made them so popular amongst the internet users.

People choose a browser that best suits them depending on many factors. When choosing a browser, the user may consider things such as overall experience, compatibility with most websites, the speed or customizability. Each of the major browsers differ slightly in these aspects. If speed and fast search results is your main concern, Chrome is your go-to browser. The fastest browser yet, it continues to improve its speed with every version, though it might lag in convenience and customizability. Latest version of Internet explorer is compatible with Windows PCs is high on customizability; however, it has lost most of its market share to faster browsers. Firefox too is not behind in both aspects: speed and customizability. It has a speed search option as well as a search bar tailored to your preference.

Apart from these obvious choices, you may want to go for one of the non-mainstream browsers that are specialized for a particular task. For frequent online gamers, ‘Coowon is designed with online gaming in mind with multiple windows to log into different game accounts and an option to increase in-game speed. If you are especially concerned with your security and privacy, ‘Whitehat Aviator’ might be something you may want check out. It doesn’t collect any private data, and opens in an Incognito mode by default. There are similar browsers which speed up downloads or have a better organization of data. These alternate browsers don’t have to replace your primary browser but can be used depending on your requirement.

Choosing the best browser depends highly on your personal preferences. From the various options available, each have their distinct advantage. Whether you prefer compatibility over speed, or customization over compatibility, you can find a browser that suit your need. However, you can also have the option of keeping a secondary browser for more specific tasks.

Summary:

A browser presents the information from the internet. The big five competitors exist in the market due to their speed, compatibility and customizability among other features. Other options for secondary browsers exist as well. Choosing the best depends on your preferences at the end of the day 

Categorized in Science & Tech

Introduction to How Internet Search Engines Work

The good news about the Internet and its most visible component, the World Wide Web, is that there are hundreds of millions of pages available, waiting to present information on an amazing variety of topics. The bad news about the Internet is that there are hundreds of millions of pages available, most of them titled according to the whim of their author, almost all of them sitting on servers with cryptic names. When you need to know about a particular subject, how do you know which pages to read? If you're like most people, you visit an Internet search engine.

Internet search engines are special sites on the Web that are designed to help people find information stored on other sites. There are differences in the ways various search engines work, but they all perform three basic tasks:
They search the Internet -- or select pieces of the Internet -- based on important words.
They keep an index of the words they find, and where they find them.
They allow users to look for words or combinations of words found in that index.
Early search engines held an index of a few hundred thousand pages and documents, and received maybe one or two thousand inquiries each day. Today, a top search engine will index hundreds of millions of pages, and respond to tens of millions of queries per day. In this article, we'll tell you how these major tasks are performed, and how Internet search engines put the pieces together in order to let you find the information you need on the Web.


Web Crawling

When most people talk about Internet search engines, they really mean World Wide Web search engines. Before the Web became the most visible part of the Internet, there were already search engines in place to help people find information on the Net. Programs with names like "gopher" and "Archie" kept indexes of files stored on servers connected to the Internet, and dramatically reduced the amount of time required to find programs and documents. In the late 1980s, getting serious value from the Internet meant knowing how to use gopher, Archie, Veronica and the rest.

Today, most Internet users limit their searches to the Web, so we'll limit this article to search engines that focus on the contents of Web pages.

Before a search engine can tell you where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. (There are some disadvantages to calling part of the Internet the World Wide Web -- a large set of arachnid-centric names for tools is one of them.) In order to build and maintain a useful list of words, a search engine's spiders have to look at a lot of pages.

How does any spider start its travels over the Web? The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web.

Google began as an academic search engine. In the paper that describes how the system was built, Sergey Brin and Lawrence Page give an example of how quickly their spiders can work. They built their initial system to use multiple spiders, usually three at one time. Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data each second.

Keeping everything running quickly meant building a system to feed necessary information to the spiders. The early Google system had a server dedicated to providing URLs to the spiders. Rather than depending on an Internet service provider for the domain name server (DNS) that translates a server's name into an address, Google had its own DNS, in order to keep delays to a minimum.

When the Google spider looked at an HTML page, it took note of two things:
The words within the page
Where the words were found

Words occurring in the title, subtitles, meta tags and other positions of relative importance were noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches.

These different approaches usually attempt to make the spider operate faster, allow users to search more efficiently, or both. For example, some spiders will keep track of the words in the title, sub-headings and links, along with the 100 most frequently used words on the page and each word in the first 20 lines of text. Lycos is said to use this approach to spidering the Web.

Other systems, such as AltaVista, go in the other direction, indexing every single word on a page, including "a," "an," "the" and other "insignificant" words. The push to completeness in this approach is matched by other systems in the attention given to the unseen portion of the Web page, the meta tags. Learn more about meta tags on the next page.

Meta Tags

Meta tags allow the owner of a page to specify key words and concepts under which the page will be indexed. This can be helpful, especially in cases in which the words on the page might have double or triple meanings -- the meta tags can guide the search engine in choosing which of the several possible meanings for these words is correct. There is, however, a danger in over-reliance on meta tags, because a careless or unscrupulous page owner might add meta tags that fit very popular topics but have nothing to do with the actual contents of the page. To protect against this, spiders will correlate meta tags with page content, rejecting the meta tags that don't match the words on the page.

All of this assumes that the owner of a page actually wants it to be included in the results of a search engine's activities. Many times, the page's owner doesn't want it showing up on a major search engine, or doesn't want the activity of a spider accessing the page. Consider, for example, a game that builds new, active pages each time sections of the page are displayed or new links are followed. If a Web spider accesses one of these pages, and begins following all of the links for new pages, the game could mistake the activity for a high-speed human player and spin out of control. To avoid situations like this, the robot exclusion protocol was developed. This protocol, implemented in the meta-tag section at the beginning of a Web page, tells a spider to leave the page alone -- to neither index the words on the page nor try to follow its links.

Building the Index

Once the spiders have completed the task of finding information on Web pages (and we should note that this is a task that is never actually completed -- the constantly changing nature of the Web means that the spiders are always crawling), the search engine must store the information in a way that makes it useful. There are two key components involved in making the gathered data accessible to users:

The information stored with the data

The method by which the information is indexed

In the simplest case, a search engine could just store the word and the URL where it was found. In reality, this would make for an engine of limited use, since there would be no way of telling whether the word was used in an important or a trivial way on the page, whether the word was used once or many times or whether the page contained links to other pages containing the word. In other words, there would be no way of building the ranking list that tries to present the most useful pages at the top of the list of search results.

To make for more useful results, most search engines store more than just the word and URL. An engine might store the number of times that the word appears on a page. The engine might assign a weight to each entry, with increasing values assigned to words as they appear near the top of the document, in sub-headings, in links, in the meta tags or in the title of the page. Each commercial search engine has a different formula for assigning weight to the words in its index. This is one of the reasons that a search for the same word on different search engines will produce different lists, with the pages presented in different orders.

Regardless of the precise combination of additional pieces of information stored by a search engine, the data will be encoded to save storage space. For example, the original Google paper describes using 2 bytes, of 8 bits each, to store information on weighting -- whether the word was capitalized, its font size, position, and other information to help in ranking the hit. Each factor might take up 2 or 3 bits within the 2-byte grouping (8 bits = 1 byte). As a result, a great deal of information can be stored in a very compact form. After the information is compacted, it's ready for indexing.

An index has a single purpose: It allows information to be found as quickly as possible. There are quite a few ways for an index to be built, but one of the most effective ways is to build a hash table. In hashing, a formula is applied to attach a numerical value to each word. The formula is designed to evenly distribute the entries across a predetermined number of divisions. This numerical distribution is different from the distribution of words across the alphabet, and that is the key to a hash table's effectiveness.

In English, there are some letters that begin many words, while others begin fewer. You'll find, for example, that the "M" section of the dictionary is much thicker than the "X" section. This inequity means that finding a word beginning with a very "popular" letter could take much longer than finding a word that begins with a less popular one. Hashing evens out the difference, and reduces the average time it takes to find an entry. It also separates the index from the actual entry. The hash table contains the hashed number along with a pointer to the actual data, which can be sorted in whichever way allows it to be stored most efficiently. The combination of efficient indexing and effective storage makes it possible to get results quickly, even when the user creates a complicated search.

Building a Search

Searching through an index involves a user building a query and submitting it through the search engine. The query can be quite simple, a single word at minimum. Building a more complex query requires the use of Boolean operators that allow you to refine and extend the terms of the search.
The Boolean operators most often seen are:
AND - All the terms joined by "AND" must appear in the pages or documents. Some search engines substitute the operator "+" for the word AND.
OR - At least one of the terms joined by "OR" must appear in the pages or documents.
NOT - The term or terms following "NOT" must not appear in the pages or documents. Some search engines substitute the operator "-" for the word NOT.
FOLLOWED BY - One of the terms must be directly followed by the other.
NEAR - One of the terms must be within a specified number of words of the other.
Quotation Marks - The words between the quotation marks are treated as a phrase, and that phrase must be found within the document or file.

Future Search

The searches defined by Boolean operators are literal searches -- the engine looks for the words or phrases exactly as they are entered. This can be a problem when the entered words have multiple meanings. "Bed," for example, can be a place to sleep, a place where flowers are planted, the storage space of a truck or a place where fish lay their eggs. If you're interested in only one of these meanings, you might not want to see pages featuring all of the others. You can build a literal search that tries to eliminate unwanted meanings, but it's nice if the search engine itself can help out.

One of the areas of search engine research is concept-based searching. Some of this research involves using statistical analysis on pages containing the words or phrases you search for, in order to find other pages you might be interested in. Obviously, the information stored about each page is greater for a concept-based search engine, and far more processing is required for each search. Still, many groups are working to improve both results and performance of this type of search engine. Others have moved on to another area of research, called natural-language queries.

The idea behind natural-language queries is that you can type a question in the same way you would ask it to a human sitting beside you -- no need to keep track of Boolean operators or complex query structures. The most popular natural language query site today is AskJeeves.com, which parses the query for keywords that it then applies to the index of sites it has built. It only works with simple queries; but competition is heavy to develop a natural-language query engine that can accept a query of great complexity.

Written By: Curt Franklin

Source:
http://computer.howstuffworks.com/internet/basics/search-engine.htm/printable 

Categorized in Online Research
Page 12 of 12

airs logo

Association of Internet Research Specialists is the world's leading community for the Internet Research Specialist and provide a Unified Platform that delivers, Education, Training and Certification for Online Research.

Get Exclusive Research Tips in Your Inbox

Receive Great tips via email, enter your email to Subscribe.

Follow Us on Social Media