Source: This article was Published educatorstechnology.com - Contributed by Member: Clara Johnson

When it comes to searching for niche-specific content Google search engine is not the best option out there. Although Google can be a good starting point from which you can delve deeper into the content area you are searching for but you can save much more time by using content-specific search engines. In today’s post, we are sharing with you some examples of academic search engines student researchers and teachers can use to search for, find and access scholarly content. We are only featuring the most popular titles, but you can always find other options to add to the list. From Google Scholar to June, these search engines can make a whole difference in your academic search. Check them out and share with us your feedback.

Some of The Best Academic Search Engines for Teachers and Student Researchers

Categorized in Search Engine

Source: This article was published thefutureofthings.com - Contributed by Member: Issac Avila

Research is the most critical step when writing an academic paper. It’s nearly impossible for students to impress and inspire the assessor with their academic paper if it’s not well-researched. It needs to contain authentic and genuine information for credibility, and that requires a credible source with authoritative reference materials.

While most academic resources can now be easily accessed online, using search engines like Google can be quite frustrating. The reason is that popular search engines like Google, Bing, and Yahoo are full of advertisements and click baits that can really deter your effectiveness. And if you’re lucky enough to find some nearly relevant information with the aforementioned search engines, you will notice that it is improperly (rarely) referenced, poorly formatted and casually presented

We both know that you can’t get away with citing WikiHow, Hubspot or Wikipedia in your research paper. So what’s next? You need a list of search engines for students which will provide credible and authentic scholarly material for your use and reference – and for that, we’ve got you covered. Below is a list of the top 9 Educational Search Engines for Students that you will find rich in authoritative, accurate and credible information for your academic projects and assignments.

If for some reason you still can’t find what you’re looking for or you are overloaded with other research papers or essays and still want to provide a high quality work with credible resources, you may try custom writing service like www.copycrafter.net/custom-writing. CopyCrafter company has qualified and experienced authors that will deliver high-quality custom essays or research papers on pretty much any subject area. And for now, you may try yourself, with the help of the following resources:

1.Google Scholar

Google Scholar is a free, customized academic search engine designed specifically for students, tutors, researchers and anyone interested in academic materials. It’s the most popular research search engine for students and it lists academic resources across a wide range of sources. It allows students and researchers to find credible information, research papers and search journals, and save them in their personal library.

2.iSEEK- Education

iSeek is another widely used and one of the best search engines for students, educators and scholars. It’s a reliable, smart, and safe tool for your academic research and paper writing. Since the search engine was specially designed with students, educators and researchers in mind, you will be able to find credible and relevant resources that will ultimately save your time.

3.Microsoft Academic Research

Most people can associate with Microsoft products and brands, and there is no denying that the company delivers some incredible quality and consistency in its project. Microsoft Academic Research is no exception; the search engine indexes a wide range of scientific journals and research publications from engineering and computer science to biology and social science. It has over 47 million publications written by more than 20 million authors. Microsoft Academic Research allows you to search resources based on authors, conferences, and domains.

4.ResearchGate

If you’re a science major, you will love ResearchGate. In fact, chances are you’ve already searched for certain academic topics in Google and ended up on the ResearchGate platform. It’s a networking site for students, researchers, and scientists and provides access to more than 100 million publications and over 15 million researchers. Other than accessing the information, the platform also lets you ask researchers questions.

5.Wolfram Alpha

Wolfram Alpha presents itself as a computational knowledge engine’ that provides results as answers. All you need is to type in the question or topic that you’re interested in like “What is the diameter of the observable universe?” and the answer will pop up. The best part is it doesn’t make you scroll through tens of pages of results. It doesn’t present search results as the other engines, but it’s great for students looking for quick, snappy answers to bits of questions as they go about their assignments and projects.

6.ScienceDirect

ScienceDirect presents itself as a leading and reliable full-text scientific database that offers access to journal publications, book chapters, and research papers. It’s one of the most popular science search engines for students with more than 20,000 books and over 2,500 journals across various scientific topics and domains. You will be able to access articles, book chapters, peer-reviewed journals and content from topics and subjects like Chemistry, Computer Science, Energy, Earth and Planetary Sciences, Engineering, Materials Science, Physics and Astronomy, Mathematics and so on.

7.RefSeek

RefSeek employs a minimalistic design, which doesn’t look like much at first, but there is a lot going on in the background. It’s probably the most aggressive search engines for students as it pulls from more than 1 billion journals, research papers, book, encyclopedias and web pages. It works more or less like Google, but it only focuses on or academic and scientific results without the distraction of paid links. So you can expect the most results from .edu and .org sites.

8.Educational Resources Information Center (ERIC)

ERIC is reliable and informative online digital library that is populated and maintained by the U.S. Department of Education. The platform provides academic and educational resources for educators, students and researchers with over 1.3 million publications. Students can find materials such as books, research papers, journals, technical reports, policy papers, dissertations, conference papers and so on. The platform receives over eight million searches per month, meaning it’s a reliable and authoritative source of academic and research information.

9.The Virtual Learning Resources Center (Virtual LRC)

Virtual LRC is a search engine for college students which allows students to search and explore educational websites with authoritative and high-quality information. The search engine indexes thousands of scholarly and academic information sites ensuring that you get the most refined and relevant results. The platforms and the results you get have been organized by researchers, library professionals and teachers around the globe to ensure that students easily get resources for their projects and academic assignments.

Conclusion

The above-named directories and databases are among the most trusted and highly reputable search engines for students to find credible, authoritative and reliable academic resources. They offer information and references on all subject areas including chemistry, biology, physics, business, social science, mathematics, computer and technology and environmental science.

Categorized in Search Engine

Source: This article was published hindustantimes.com By Karen Weise and Sarah Frier - Contributed by Member: David J. Redcliff

For scholars, the scale of Facebook’s 2.2 billion users provides an irresistible way to investigate how human nature may play out on, and be shaped by, the social network.

The professor was incredulous. David Craig had been studying the rise of entertainment on social media for several years when a Facebook Inc. employee he didn’t know emailed him last December, asking about his research. “I thought I was being pumped,” Craig said. The company flew him to Menlo Park and offered him $25,000 to fund his ongoing projects, with no obligation to do anything in return. This was definitely not normal, but after checking with his school, University of Southern California, Craig took the gift. “Hell, yes, it was generous to get an out-of-the-blue offer to support our work, with no strings,” he said. “It’s not all so black and white that they are villains.”

Other academics got these gifts, too. One, who said she had $25,000 deposited in her research account recently without signing a single document, spoke to a reporter hoping maybe the journalist could help explain it. Another professor said one of his former students got an unsolicited monetary offer from Facebook, and he had to assure the recipient it wasn’t a scam. The professor surmised that Facebook uses the gifts as a low-cost way to build connections that could lead to closer collaboration later. He also thinks Facebook “happily lives in the ambiguity” of the unusual arrangement. If researchers truly understood that the funding has no strings, “people would feel less obligated to interact with them,” he said.

The free gifts are just one of the little-known and complicated ways Facebook works with academic researchers. For scholars, the scale of Facebook’s 2.2 billion users provides an irresistible way to investigate how human nature may play out on, and be shaped by, the social network. For Facebook, the motivations to work with outside academics are far thornier, and it’s Facebook that decides who gets access to its data to examine its impact on society.“Just from a business standpoint, people won’t want to be on Facebook if Facebook is not positive for them in their lives,” said Rob Sherman, Facebook’s deputy chief privacy officer. “We also have a broader responsibility to make sure that we’re having the right impact on society.”

The company’s long been conflicted about how to work with social scientists, and now runs several programs, each reflecting the contorted relationship Facebook has with external scrutiny. The collaborations have become even more complicated in the aftermath of the Cambridge Analytica scandal, which was set off by revelations that a professor who once collaborated with Facebook’s in-house researchers used data collected separately to influence elections. ALSO READ: Facebook admits it tracks your mouse movements

“Historically the focus of our research has been on product development, on doing things that help us understand how people are using Facebook and build improvements to Facebook,” Sherman said. Facebook’s heard more from academics and non-profits recently who say “because of the expertise that we have, and the data that Facebook stores, we have an opportunity to contribute to generalizable knowledge and to answer some of these broader social questions,” he said. “So you’ve seen us begin to invest more heavily in social science research and in answering some of these questions.”

Facebook has a corporate culture that reveres research. The company builds its product based on internal data on user behaviour, surveys and focus groups. More than a hundred Ph.D.-level researchers work on Facebook’s in-house core data science team, and employees say the information that points to growth has had more of an impact on the company’s direction than Chief Executive Officer Mark Zuckerberg’s ideas.

Facebook is far more hesitant to work with outsiders; it risks unflattering findings, leaks of proprietary information, and privacy breaches. But Facebook likes it when external research proves that Facebook is great. And in the fierce talent wars of Silicon Valley, working with professors can make it easier to recruit their students.

It can also improve the bottom line. In 2016, when Facebook changed the “like” button into a set of emojis that better-captured user expression—and feelings for advertisers— it did so with the help of Dacher Keltner, a psychology professor at the University of California, Berkeley, who’s an expert in compassion and emotions. Keltner’s Greater Good Science Center continues to work closely with the company. And this January, Facebook made research the centerpiece of a major change to its news feed algorithm. In studies published with academics at several universities, Facebook found that people who used social media actively—commenting on friends’ posts, setting up events—were likely to see a positive impact on mental health, while those who used it passively may feel depressed. In reaction, Facebook declared it would spend more time encouraging “meaningful interaction.” Of course, the more people engage with Facebook, the more data it collects for advertisers.

The company has stopped short of pursuing deeper research on the potentially negative fallout of its power. According to its public database of published research, Facebook’s written more than 180 public papers about artificial intelligence but just one study about elections, based on an experiment Facebook ran on 61 million users to mobilize voters in the Congressional midterms back in 2010. Facebook’s Sherman said, “We’ve certainly been doing a lot of work over the past couple of months, particularly to expand the areas where we’re looking.”

Facebook’s first peer-reviewed papers with outside scholars were published in 2009, and almost a decade into producing academic work, it still wavers over how to structure the arrangements. It’s given out the smaller unrestricted gifts. But those gifts don’t come with access to Facebook’s data, at least initially. The company is more restrictive about who can mine or survey its users. It looks for research projects that dovetail with its business goals.

Some academics cycle through one-year fellowships while pursuing doctorate degrees, and others get paid for consulting projects, which never get published.

When Facebook does provide data to researchers, it retains the right to veto or edit the paper before publication. None of the professors Bloomberg spoke with knew of cases when Facebook prohibited a publication, though many said the arrangement inevitably leads academics to propose investigations less likely to be challenged. “Researchers focus on things that don’t create a moral hazard,” said Dean Eckles, a former Facebook data scientist now at the MIT Sloan School of Management. Without a guaranteed right to publish, Eckles said, researchers inevitably shy away from potentially critical work. That means some of the most burning societal questions may go unprobed.

Facebook also almost always pairs outsiders with in-house researchers. This ensures scholars have a partner who’s intimately familiar with Facebook’s vast data, but some who’ve worked with Facebook say this also creates a selection bias about what gets studied. “Stuff still comes out, but only the immensely positive, happy stories—the goody-goody research that they could show off,” said one social scientist who worked as a researcher at Facebook. For example, he pointed out that the company’s published widely on issues related to well-being, or what makes people feel good and fulfilled, which is positive for Facebook’s public image and product. “The question is: ‘What’s not coming out?,’” he said.

Facebook argues its body of work on well-being does have broad importance. “Because we are a social product that has large distribution within society, it is both about societal issues as well as the product,” said David Ginsberg, Facebook’s director of research.Other social networks have smaller research ambitions, but have tried more open approaches. This spring, Twitter Inc. asked for proposals to measure the health of conversations on its platform, and Microsoft Corp.’s LinkedIn is running a multi-year programme to have researchers use its data to understand how to improve the economic opportunities of workers. Facebook has issued public calls for technical research, but until the past few months, hasn’t done so for social sciences. Yet it has solicited in that area, albeit quietly: Last summer, one scholarly association begged discretion when sharing information on a Facebook pilot project to study tech’s impact in developing economies. Its email read, “Facebook is not widely publicizing the program.”

In 2014, the prestigious Proceedings of the National Academy of Sciences published a massive study, co-authored by two Facebook researchers and an outside academic, that found emotions were “contagious” online, that people who saw sad posts were more likely to make sad posts. The catch: the results came from an experiment run on 689,003 Facebook users, where researchers secretly tweaked the algorithm of Facebook’s news feed to show some cheerier content than others. People were angry, protesting that they didn’t give Facebook permission to manipulate their emotions.

The company first said people allowed such studies by agreeing to its terms of service, and then eventually apologized. While the academic journal didn’t retract the paper, it issued an “Editorial Expression of Concern.”

To get federal research funding, universities must run testing on humans through what’s known as an institutional review board, which includes at least one outside expert, approves the ethics of the study and ensures subjects provide informed consent. Companies don’t have to run research through IRBs. The emotional-contagion study fell through the cracks.

The outcry profoundly changed Facebook’s research operations, creating a review process that was more formal and cautious. It set up a pseudo-IRB of its own, which doesn’t include an outside expert but does have policy and PR staff. Facebook also created a new public database of its published research, which lists more than 470 papers. But that database now has a notable omission—a December 2015 paper two Facebook employees co-wrote with Aleksandr Kogan, the professor at the heart of the Cambridge Analytica scandal. Facebook said it believes the study was inadvertently never posted and is working to ensure other papers aren’t left off in the future.

In March, Gary King, a Harvard University political science professor, met with some Facebook executives about trying to get the company to share more data with academics. It wasn’t the first time he’d made his case, but he left the meeting with no commitment.

A few days later, the Cambridge Analytica scandal broke, and soon Facebook was on the phone with King. Maybe it was time to cooperate, at least to understand what happens in elections. Since then, King and a Stanford University law professor have developed a complicated new structure to give more researchers access to Facebook’s data on the elections and let scholars publish whatever they find. The resulting structure is baroque, involving a new “commission” of scholars Facebook will help pick, an outside academic council that will award research projects, and seven independent U.S. foundations to fund the work. “Negotiating this was kind of like the Arab-Israel peace treaty, but with a lot more partners,” King said.

The new effort, which has yet to propose its first research project, is the most open approach Facebook’s taken yet. “We hope that will be a model that replicates not just within Facebook but across the industry,” Facebook’s Ginsberg said. “It’s a way to make data available for social science research in a way that means that it’s both independent and maintains privacy.” But the new approach will also face an uphill battle to prove its credibility. The new Facebook research project came together under the company’s public relations and policy team, not its research group of PhDs trained in ethics and research design. More than 200 scholars from the Association of Internet Researchers, a global group of interdisciplinary academics, have signed a letter saying the effort is too limited in the questions it’s asking, and also that it risks replicating what sociologists call the “Matthew effect,” where only scholars from elite universities—like Harvard and Stanford—get an inside track.

“Facebook’s new initiative is set up in such a way that it will select projects that address known problems in an area known to be problematic,” the academics wrote. The research effort, the letter said, also won’t let the world—or Facebook, for that matter—get ahead of the next big problem.

Categorized in Social

Source: This article was emergingedtech.com By Katie Alice - Contributed by Member: Bridget Miller

Whether Conducting Academic Research or Purely Scientific Research, These Sites can be an Invaluable Aid.

Researching is the most crucial step in writing a scientific paper. It is always a well-researched scientific paper that inspires the assessor. At the same time, it must have genuine and authentic information for credibility. With the development in the Internet industry, i.e., web resources, researching for scientific materials has now become a matter of a few clicks. Now students can get information on any topic pertaining to science through academic search engines. They provide a centralized platform and allow the students to acquire literature on any topic within seconds.

scientific academic research image top internet sources

While there are many academic search engines available, there are some that have the most trusted resources. They provide information on a range of topics from Engineering and technology to Biology and Natural Science. They provide a one-stop solution to all research-related needs for a scientific paper. Besides, they provide a personal and customized way to search research materials on any given topic. This article will focus on some popular academic search engines that have revolutionized the way information is researched by the students. They are rich in information and have the highest level of credibility.

  1. Google Scholar (http://scholar.google.com/):Google Scholar is a free academic search engine that indexes academic information from various online web resources. The Google Scholar lists information across an array of academic resources, mostly are peer-reviewed. It works in the same manner as Scirus. Founded in 2004, it is one of the widely used academic resources for researchers and scholars.
  2. CiteSeerx(http://citeseerx.ist.psu.edu): CiteSeerx is a digital library and an online academic journal that offer information within the field of computer science. It indexes academic resources through autonomous citation indexing system. This academic database is particularly helpful for students seeking information on computer and information sciences. It offers many other exclusive features to facilitate the students with the research process that include: ACI – Autonomous Citation Indexing, reference linking, citation statistics, automatic metadata extraction and related documents. Founded in 1998, it is the first online academic database and has since evolved into a more dynamic and user-friendly academic search engine.
  3. GetCITED(http://www.getcited.org/): GetCITED is another powerful tool for searching scientific information. It is an online academic database that indexes academic journals and citations. It is a one-stop platform that offers everything related to academic publications such as chapters, conference papers, reports and presentations. You can even browse through the bibliographies to search related details. Furthermore, you can find information on any author and his published works. The two ‘most outstanding’ features of this academic search engine tool include: ‘a comprehensive database’ and ‘discussion forum’. It allows every member from academia to contribute in its database resources. It has over 3,000,000 written by more than 3,00,000 authors.
  4. Microsoft Academic Research(http://academic.research.microsoft.com/): Microsoft academic research is yet another top search engine for academic resources. Developed by Microsoft Research, it has more than 48 million publications written by over 20 million authors. It indexes range of scientific journals from computer science and engineering to social science and biology. It has brought in many new ways to search academic resources, such as papers, authors, conferences, and journals. This academic search engine allows you to search information based on authors or domains.
  5. Bioline International(http://www.bioline.org.br/): Bioline is among the most trusted and authentic search engines that have peer-reviewed academic journals on public health, food and nutritional security, food and medicine and biodiversity. It provides free access to peer-reviewed journals from third world countries. It promotes an exchange of ideas through academic resources. Founded in 1993, it has 70 journals across 15 countries that offer information on subjects like crop science, biodiversity, public health and international development.
  6. Directory of Open Access Journals(http://www.doaj.org/): Director of Open Access Journals (DOAJ) is yet another free search engine for scientific and scholarly resources. The directory offers a huge range of topics within scientific areas of study. It is among the richest sources of the scholarly database with over 8,000 journals available on different topics. All the journals are thoroughly peer-reviewed.
  7. PLOS ONE (http://www.plosone.org/): Founded in 2006, PLOSE ONE provides a free access platform to everyone searching for science-related information. All the articles published on PLOS ONE are published after going through a strict peer-reviewed process. This academic database has a meticulous procedure for publishing a journal. You can find plenty of articles and academic publications using this platform.
  8. BioOne (http://www.bioone.org/): An excellent search engine for scientific information, BioOne contains academic resources for biological, environmental and ecological sciences. Established in 2000, it started as an NGO and later became an online academic journal directory. The journal gives free access to over 25000 institutions all over the world.
  9. Science and Technology of Advanced Materials(http://iopscience.iop.org/1468-6996/): First published in 2000, the science and technology of advanced materials became online in 2008. This peer-reviewed academic journal offers free access to academic journals on major areas of science and technology. The academic directory is totally free of cost and provides easy and simple access to the plethora of information covering scientific subject-matters.
  10. New Journal of Physics (http://iopscience.iop.org/1367-2630):New Journal of Physics is an online scientific search engine that has academic databases with physics as core subject. Founded in 1998, it is co-founded by the Institute Of Physics and Deutsche Physikalische Gesellschaft. The search engine offers academic journals on diversified topics with physics as central theme.
  11. ScienceDirect(http://www.sciencedirect.com/): “A leading full-text scientific database offering journal articles and book chapters from more than 2,500 journals and almost 20,000 books.”

The above mentioned academic database and directories are among the most trusted search engines for scientific research. They offer information on possibly all the major areas of science including computer and technology, biology, environmental science and social sciences, and other areas of academic research.

Categorized in Search Engine

There are many things you will love and hate doing while at the university or college, and research is one of them. You will love it for the new exciting things you discover, but hate it for the tedious hours of digging into thousands of pages both online and at the library. One thing that will for sure save you a lot of stress is having reliable sources and useful websites to dig more effectively.

Every university has its own database that you are encouraged to use under your username. But in order to produce a good piece of work, you must consider as much relevant information as possible. So below is a list of handpicked search engines that can be used for academic purposes.

1. Google Scholar

google-scholar

Back at high school, you used to turn to Google for whatever it is you needed to find. Well, this doesn’t change much at the university or college level, only you will now use Google Scholar. It is the same search engine that limits your results to strictly academic sources, such as legal or nursing documents, scientific articles and etc.

2. iSEEK
iSeek is probably one of the most popular scholarly search engines used by both educators and their students. Its potential in finding relevant information is incredible and you will definitely appreciate the way your results can be filtered by place, subject, author or even school level.

3. Google Book Search
It certainly pays off to go to a library, but unlike at the university library, you can always be 100% sure you will find the book you are looking for using Google book search. Use it for finding references to your subject in both newest and old books, browse the pages that interest you and know exactly where you can find that book if you need to have it.

4. Google Correlate
Google is very cool about coming up with some of the best ways to search for information. Using Google Correlate you can find data that compares (or correlates) with whatever issue you are studying.

5. RefSeek
RefSeek probably has just as many books, articles and newspaper copies as Google does. It filters your results showing specifically the trustworthy academic pages and has no distracting paid links.

6. Microsoft Academic Search

microsoft-academic

Microsoft is one useful multifunctional system that you can use not only for typing or formatting your paper but also for browsing an enormous amount (over 40 million) of educational resources. I personally love it for the kind of visual content (infographics and charts) you can find there.

7. Online Journal Search Engine
If you know exactly what you are looking for and need a lot of criteria to narrow down your search, then The Online Journal Search Engine should be your choice. It gives you over 60 filter options and has search capacity over some hundreds of scientific databases.

8. Open Directory of Open Access Repositories
Yet another custom Google search engine, Open DOAR is wired to search free academic resources. This is perfect for interim student research papers, since paying for scientific publications can really make you go broke.

9. Virtual Learning Resources Center
For those who are simply browsing the topic and want just academic websites to pop up in the results – Virtual LRC is a type of a custom Google search that filters out non-academic pages.

10. Science Direct

science-direct

Science Direct is a simple scientific search engine that lets you find a keyword and has a place to put your main filters (such as author, journal, year) right on the main page.

11. State Legislative Websites Directory
Lawyers will appreciate the possibility to browse all 50 State legislative websites in one place.

12. National Archives
National archives can have a lot of first-hand information that will be useful to you, such as original documents, bills, historic artifacts and etc. This is a place to browse all of them in the same time.

13. Archives Hub
A free search engine that digs through public British archives from Scotland, Wales and England.

14. Library of Congress
It is hard to think of a more trustworthy resource than the electronic archive of the biggest library in the world. With the digital era penetrating into everything, they now allow you to search books, historical photos and much more.

15. National Agricultural Library

national-agricultural-library

U.S. Department of Agriculture has come up with an extensive website where you can find a lot of trustworthy sources on agricultural topics.

16. Smithsonian Institution Research Information System
Have you ever been to the Smithsonian museums in Washington? Now you can have access to all that information online – and that is almost 8 million records!

17. Pixsy
Use these engines for searching visual files. You can even find YouTube videos there, as well as search suitable images for your text by just drawing a sketch.

18. Cornell University Archive
Many universities have their own databases, but very few make them available to anyone who’s looking. Cornell University gave you open access to all their digital academic resources – it would be a shame to miss out on this chance.

19. U.S. Government Publications Catalog
A lot of teachers appreciate links to publications in your works, so here is another resource that allows you to find a lot of things in all the American Government publications, whether they are recent or have been published years ago.

20. UCR Library

ucr-library

One of the best online databases for scientific publications, books and other scholarly resources.

21. The British Library
While the United States have the biggest library, Brits have good ones to brag about as well. And professors love it when you use foreign resources. So check out the British Library for a different kind of perspective on your topic.

22. WolframAlpha
The future of smart search might be here: Wolfram doesn’t just generate similar results, it analyses them for you and answers the questions or tasks you set.

23. Digital Library of the Commons
This is a great source of free international literature, be it dissertations, books or papers.

24. OAIster
Not all the valuable information is written in black and white. You can generate a lot of quality insights and explanations from listening to over millions of records related to your topic. And the best part is that you can listen to them while having lunch or dinner – talk about time saving!

25. Directory of Open Access Journals
Another search engine that is programmed to dig through thousands of scientific and scholarly resources. They also have a strong focus on the peer-reviewed publications and resources.

Source: This article was published quertime.com By Kelly J. Harris

Categorized in Search Engine

As Google Scholar approaches its 10th anniversary, Nature spoke to its co-creator Anurag Acharya

Google Scholar, the free search engine for scholarly literature, turns ten years old on November 18. By 'crawling' over the text of millions of academic papers, including those behind publishers' paywalls, it has transformed the way that researchers consult the literature online. In a Nature survey this year, some 60% of scientists said that they use the service regularly. Nature spoke with Anurag Acharya, who co-created the service and still runs it, about Google Scholar's history and what he sees for its future.

How do you know what literature to index?

'Scholarly' is what everybody else in the scholarly field considers scholarly. It sounds like a recursive definition but it does settle down. We crawl the whole web, and for a new blog, for example, you see what the connections are to the rest of scholarship that you already know about. If many people cite it, or if it cites many people, it is probably scholarly. There is no one magic formula: you bring evidence to bear from many features.

Where did the idea for Google Scholar come from?

I came to Google in 2000, as a year off from my academic job at the University of California, Santa Barbara. It was pretty clear that I was unlikely to have a larger impact [in academia] than at Google — making it possible for people everywhere to be able to find information. So I gave up on academia and ran Google’s web-indexing team for four years. It was a very hectic time, and basically, I burnt out.

Alex Verstak [Acharya’s colleague on the web-indexing team] and I decided to take a six-month sabbatical to try to make finding scholarly articles easier and faster. The idea wasn’t to produce Google Scholar, it was to improve our ranking of scholarly documents in web search. But the problem with trying to do that is figuring out the intent of the searcher. Do they want scholarly results or are they a layperson? We said, “Suppose you didn’t have to solve that hard a problem; suppose you knew the searcher had a scholarly intent.” We built an internal prototype, and people said: “Hey, this is good by itself. You don’t have to solve another problem — let’s go!” Then Scholar clearly seemed to be very useful and very important, so I ended up staying with it.

Was it an instant success?

It was very popular. Once we launched it, usage grew exponentially. One big difference was that we were relevance-ranking [sorting results by relevance to the user’s request], which scholarly search services had not done previously. They were reverse-chronological [providing the newest results first]. And we crawled the full text of research articles, though we did not include the full text from all the publishers when we started.

It took years in some cases to convince publishers to let you crawl their full text. Was that hard?

It depends. You have to think back to a decade ago, when web search was considered lightweight — what people would use to find pictures of Britney Spears, not scholarly articles. But we knew people were sending us purely academic queries. We just had to persuade publishers that our service would be used and would bring them more traffic. We were working with many of them already before Google Scholar launched, of course.

In 2012 Google Scholar was removed from the drop-down menu of search options on Google’s home page. Do you worry that Google Scholar might be downgraded or killed?

No. Our team is continually growing, from two people at the start to nine now. People may have treated that menu removal as a demotion, but it wasn’t really. Those menu links are to help users get from the home page to another service, so they emphasize the most-used transitions. If users already know to start with Google Scholar, they don’t need that transition. That’s all it was.

How does Google Scholar make money?

Google Scholar does not currently make money. There are many Google services that do not make a significant amount of money. The primary role of Scholar is to give back to the research community, and we are able to do so because it is not very expensive, from Google’s point of view. In terms of volume of queries, Google Scholar is small compared to many Google services, so opportunities for advertisement monetization are relatively small. There’s not been pressure to monetize. The benefits that Scholar provides, given the number of people who are working on it, are very significant. People like it internally — we are all, in part, ex-academics.

How many queries does Google Scholar get every day, and how much literature does the service track? (Estimates place it anywhere from 100 million to 160 million scholarly items).
I’m unable to tell you, beyond a very, very large number. The same answer for the literature, except that the number of items indexed has grown about an order of magnitude since we launched. A lot of people wonder about the size. But this kind of discussion is not useful — it’s just 'bike-shedding'. Our challenge is to see how often people are able to find the articles they need. The index size might be a concern here if it was too small. But we are clearly large enough.

Google Scholar has introduced extra services: author profile pages and a recommendations engine, for instance. Is this changing it from a search engine to something closer to a bibliometrics tool?

Yes and no. A significant purpose of profiles is to help you to find the articles you need. Often you don’t remember exactly how to find an article, but you might pivot from a paper you do remember to an author and to their other papers. And you can follow other people’s work — another crucial way of finding articles. Profiles have other uses, of course. Once we know your papers, we can track how your discipline has evolved over time, the other people in the scholarly world that you are linked to, and can even recommend other topics that people in your field are interested in. This helps the recommendations engine, which is a step beyond [a search engine].

Are you worried about the practice known as gaming — people creating fake papers, getting them indexed by Google, and gaining fake citations?

Not really. Yes, you can add any papers you want. But everything is completely visible — articles in your profiles, articles citing yours, where they are hosted, and so on. Anyone in the world can call you on it, basically killing your career. We don’t see spam for that very reason. I have a lot of experience dealing with spam because I used to work on web search. Spam is easier when people are anonymous. If I am trying to build a publication history for my public reputation, I will be relatively cautious. 

What features would you like to see in the future?

We are very good at helping people to find the articles they are looking for and can describe. But the next big thing we would like to do is to get you the articles that you need, but that you don’t know to search for. Can we make serendipity easier? How can we help everyone to operate at the research frontier without them having to scan over hundreds of papers — a very inefficient way of finding things — and do nothing else all day long?

I don’t know how we will make this happen. We have some initial efforts on this (such as the recommendations engine), but it is far from what it needs to be. There is an inherent problem to giving you information that you weren’t actively searching for. It has to be relevant — so that we are not wasting your time — but not too relevant, because you already know about those articles. And it has to avoid short-term interests that come and go: you look up something but you don’t want to get spammed about it for the rest of your life. I don’t think getting our users to ‘train’ a recommendations model will work — that is too much effort.

(For more on recommendation services, see 'How to tame the flood of literature', in Nature's Toolbox section.)

What about helping people search directly for scientific data, not papers?

That is an interesting idea. It is feasible to crawl over data buried inside paywalled papers, as we do with full text. But then if we link the user to the paywalled article, they don’t see this data — just the paper’s abstract. For indexing full-text articles, we depend on that abstract to let users estimate the probable utility of the article. For data we don't have anything similar. So as a field of scholarly communication, we haven’t yet developed a model that would allow for a useful data-search service.

Many people would like to have an API (Application Programming Interface) in Google Scholar, so that they could write programs that automatically make searches or retrieve profile information, and build services on top of the tool. Is that possible?

I can’t do that. Our indexing arrangements with publishers preclude it. We are allowed to scan all the articles, but not to distribute this information to others in bulk. It is important to be able to work with publishers so we can continue to build a comprehensive search service that is free to everybody. That is our primary function, and everything else is in addition to this.

Do you see yourself working at Google Scholar for the next decade?

I didn’t expect to work on Google Scholar for ten years in the first place! My wife reminds me it was supposed to be five, then seven years — and now I’m still not leaving. But this is the most important thing I know I can do. We are basically making the smartest people on the planet more effective. That’s a very attractive proposition, and I don’t foresee moving away from Google Scholar any time soon, or any time easily.

Does your desire for a free, effective search engine go back to your time as a student at the Indian Institute of Technology Kharagpur?
It influenced the problems that appealed to me. For example, there is no other service that indexes the full texts of papers even when the user can see only the abstract. The reason I thought this was an important direction to go in was that I realised users needed to know the information was there. If you know the information is in a paywalled paper, and it is important to you, you will find a way in: you can write to the author, for instance. I did that in Kharagpur — it was really ineffective and slow! So my experiences informed the approach I took. But at this point, Google Scholar has a life of its own. 

Should people who use Google Scholar have concerns about data privacy?

We use the standard Google data-collection policies — there is nothing different for Scholar. My role at Google is focused on Google Scholar. So I am not going to be able to say more about broader issues.

Source: This article was published scientificamerican.com By Richard Van Noorden

Categorized in Search Engine

Academics say they have been forced to leave the country to pursue their research interests as British universities are accused of blocking studies over fears of a backlash on social media.

As they come under increasing attack from online activists, some of the country’s leading academics have accused universities of putting their reputations before their responsibility to defend academic freedom.

Speaking to The Sunday Telegraph, they claim that university ethics committees are now “drifting into moral vanity” by vetoing research in areas that are seen as “politically incorrect”.

Their comments come amid widespread concern for free speech on campuses, with the Government urging universities to do more to counter the rise of so-called safe spaces and “no-platforming”.

James Caspian, who has been banned by a university from doing transgender research

James Caspian, who has been banned by a university from doing transgender research CREDIT:GEOFF PUGH FOR THE TELEGRAPH

The academics have decided to speak out as James Caspian, one of the country’s leading gender specialists, revealed that he is planning to take Bath Spa University to judicial review over its decision to turn down his research into transgenderism.

A professor who recently left a prestigious Russell Group institution to work in Italy said that while safeguards were needed to ensure research was conducted ethically, some universities now appeared to be “covering their own arses”.

“I’ve certainly heard and known of ethics committees voicing concerns about parts of research that would to most of us seem ridiculous. I think they sometimes go too far.

“In general I’m supportive of ethics committees, but there is room for discussion on their criteria. Attracting a lot unwanted attraction on social media...most researchers would not consider that relevant.

“That’s a matter for the PR office, not an ethics committee.”

Prof Sheila Jeffreys

Prof Sheila Jeffreys CREDIT: THE AGE/SIMON SCHLUTER

Dr. Heather Brunskell-Evans, a fellow of King’s College London who has previously sat on research awarding bodies, claimed that some universities were becoming “authoritarian”.

Universities project themselves as places of open debate, while at the same time they are very worried about being seen to fall foul of the consensus,” she added.

“They are increasingly managerial and bureaucratic. They are now prioritizing the risk of reputational damage over their duty to uphold freedom of inquiry.”

Dr. Brunskell-Evans said she has encountered resistance when researching the dangers associated with prostitution, adding that many universities had “shut down” any critical analysis of the subject which might offend advocates in favor of legalisation.

Whilst working at the University of Leicester, she claimed that a critical analysis she published of Vanity Fair magazine’s visual representation of the transgendering of Bruce to Caitlyn Jenner had been pulled after complaints were made.

It was later republished after the university’s lawyers were consulted. The University of Leicester was unavailable for comment

Others said research decisions are increasingly based on how much money could be generated through research grants, meaning “trendy” and “fashionable” subjects were being prioritised over controversial topics.

“The work done by myself and others would not happen today. University now is about only speaking views which attract funding,” said Prof Sheila Jeffreys, a British feminist and former political scientist at the University of Melbourne.

“I was offered the job in Melbourne because they wanted someone specifically to teach this stuff. It would have been difficult to get back [into a British university]. I suspect that even if I wanted to take up a fellowship I would struggle.”

Dr. Werner Kierski, a psychotherapist who has taught at Anglia Ruskin and Middlesex, added: “They [ethics committees] have become hysterical. If it’s not blocking research, it’s putting limits on what researchers can do.

“In one case, I had an ethics committee force my researchers to text me before and after interviewing people, to confirm that they are still alive.

It’s completely unnecessary and deeply patronising.

“We’ve reached a point where research conducted in other countries will become increasingly dominant. UK research will become insignificant because they [researchers] are so stifled by ethics requirements.”

Bath Spa University caused controversy earlier this year when it emerged that it had declined Mr. Caspian’s research proposal to examine why growing numbers of transgender people were reversing their transition surgery.

After accepting his proposal in 2015, the university later U-turned when Mr. Caspian asked to look for participants on online forums, informing him that his research could provoke “unnecessary offence” and “attacks on social media”.

Jo Johnson, Universities Minister

Jo Johnson, Universities Minister CREDIT: GETTY

Bath Spa has since offered to refund a third of Mr. Caspian’s fees but has rejected his request for an internal review.

A university spokesman said it would “not be commenting further at this stage”.

Mr. Caspian is now crowdfunding online in order to fight the case and has received almost £6,000 in donations from fellow academics and trans people who support his work.

In a letter sent this week to the universities minister Jo Johnson, Mr. Caspian writes that the “suppression of research on spurious grounds” is a growing problem in Britain.

“I have already heard of academics leaving the UK for countries where they felt they would be more welcomed to carry out their research,” the letter continues.

“I believe that it should be made clear that any infringement of our academic freedom should not be allowed. I would ask you to consider the ramifications should academics continue to be censored in this way.”

Last night, Mr. Johnson said that academic freedom was the “foundation of higher education”, adding that he expected universities to “protect and promote it”.

Under the new Higher Education and Research Act, he said that universities would be expected to champion “the freedom to put forward new ideas and controversial or unpopular opinions”.

A spokesman for Universities UK said that its members had “robust processes” to ensure that all research was conducted appropriately.

“They also recognise that there may be legitimate academic reasons to study matters which may be controversial in nature,” they added.

 Source: This article was published telegraph.co.uk By Harry Yorke

Categorized in Online Research

The library has a new online search tool called WorldShare Management platform that is the “Google” of academic search engines, library Director Johnathan Wilson said in an interview. 

Students can get a lot of information through search engines such as Bing, Google, and Discovery. But Worldshare Management platform provides higher quality searches for research papers and writing assignments, he said Nov. 8.  

Before the platform’s implementation in June, students conducting research projects used multiple, independent databases to search for information, Wilson said.

This process took more time, and students could miss information by not searching each database. The system produced limited information because it only searched through this college’s library’s content, he said.   

Wilson said the WorldShare Management platform is fast and dynamic.

He said the system is owned by Online Computer Library Center, a company that is the biggest name in the library world and manages the Library of Congress and other giant collections, he said.

The system provides students with a diverse collection of materials by accessing shared resources from libraries managed by the online computer library center. This allows the system to integrate with inter-loan library systems to make a resource available that otherwise would not be at this college’s library.

Students can call the library staff to have the resource delivered to this library, Wilson said.   

He said students receive search results from a discovery tool that isn’t a limited catalog structure like the previous system. Students receive diverse results from printed books, e-books, journal articles, images, repository search items and other resources. He estimates the total number of resources available as in the “hundreds of thousands.”

“I encourage students doing a research project to try the new library system,” Wilson said. “This tool will allow them to find things they wouldn’t find in the normal route of doing research.”  

     He said students can access WorldShare Management platform by going to the library homepage on the college website and selecting “library discovery your all-in-one-search.”

Technology services Librarian Lee LeBlanc said the five Alamo Colleges libraries are using the same research platform and using their own budgets to pay for the new platform.

The implementation cost for the new library services platform was $131,891.00 for all five Alamo college libraries, LeBlanc said.

The five Alamo Colleges libraries will pay a total of $138,425 to maintain the system during the 2017-18 academic year, he said.

 The library staff has received positive feedback for the convenience that the all-in-one search engine provides, Wilson said.   

     Students can get assistance using the new platform by asking any member of the library staff or by calling 210-486-1084.

Source: This article was published theranger.org By Tania Flores

Categorized in Online Research

The academic world is supposed to be a bright-lit landscape of independent research pushing back the frontiers of knowledge to benefit humanity.

Years of fingernail-flicking test tubes have paid off by finding the elixir of life. Now comes the hard stuff: telling the world through a respected international journal staffed by sceptics.

After drafting and deleting, adding and revising, the precious discovery has to undergo the ritual of peer-reviews. Only then may your wisdom arouse gasps of envy and nods of respect in the world’s labs and lecture theatres.

The goal is to score hits on the international SCOPUS database (69 million records, 36,000 titles – and rising as you read) of peer-reviewed journals. If the paper is much cited, the author’s CV and job prospects should glow.

SCOPUS is run by Dutch publisher Elsevier for profit.

It’s a tough track up the academic mountain; surely there are easier paths paved by publishers keen to help?

Indeed – but beware. The 148-year old British multidisciplinary weekly Nature calls them “predatory journals” luring naive young graduates desperate for recognition.

‘Careful checking’

“These journals say: ‘Give us your money and we’ll publish your paper’,” says Professor David Robie of New Zealand’s Auckland University of Technology. “They’ve eroded the trust and credibility of the established journals. Although easily picked by careful checking, new academics should still be wary.”

Shams have been exposed by getting journals to print gobbledygook papers by fictitious authors. One famous sting reported by Nature had a Dr. Anna O Szust being offered journal space if she paid. “Oszust” is Polish for “a fraud”.

Dr Robie heads AUT’s Pacific Media Centre, which publishes the Pacific Journalism Review, now in its 23rd year. During November he was at Gadjah Mada University (UGM) in Yogyakarta, Central Java, helping his Indonesian colleagues boost their skills and lift their university’s reputation.

The quality of Indonesian learning at all levels is embarrassingly poor for a nation of 260 million spending 20 percent of its budget on education.

The international ranking systems are a dog’s breakfast, but only UGM, the University of Indonesia and the Bandung Institute of Technology just make the tail end of the Times Higher Education world’s top 1000.

There are around 3500 “universities” in Indonesia; most are private. UGM is public.

UGM has been trying to better itself by sending staff to Auckland, New Zealand, and Munich, Germany, to look at vocational education and master new teaching strategies.

Investigative journalism

Dr. Robie was invited to Yogyakarta through the World Class Professor (WCP) programme, an Indonesian government initiative to raise standards by learning from the best.

Dr. Robie lectured on “developing investigative journalism in the post-truth era,” researching marine disasters and climate change. He also ran workshops on managing international journals.

During a break at UGM, he told Strategic Review that open access – meaning no charges made to authors and readers – was a tool to break the user-pays model.

AUT is one of several universities to start bucking the international trend to corral knowledge and muster millions. The big publishers reportedly make up to 40 percent profit – much of it from library subscriptions.

Pacific Journalism Review’s Dr. David Robie being presented with a model of Universitas Gadjah Mada’s historic main building for the Pacific Media Centre at the editor's workshop in Yogyakarta, Indonesia.

According to a report by AUT digital librarians Luqman Hayes and Shari Hearne, there are now more than 100,000 scholarly journals in the world put out by 3000 publishers; the number is rocketing so fast library budgets have been swept away in the slipstream.

In 2016, Hayes and his colleagues established Tuwhera (Māori for “be open”) to help graduates and academics liberate their work by hosting accredited and refereed journals at no cost.

The service includes training on editing, presentation and creating websites, which look modern and appealing. Tuwhera is now being offered to UGM – but Indonesian universities have to lift their game.

Language an issue
The issue is language and it’s a problem, according to Dr. Vissia Ita Yulianto, researcher at UGM’s Southeast Asian Social Studies Centre (CESASS) and a co-editor of IKAT research journal. Educated in Germany she has been working with Dr. Robie to develop journals and ensure they are top quality.

“We have very intelligent scholars in Indonesia but they may not be able to always meet the presentation levels required,” she said.

“In the future, I hope we’ll be able to publish in Indonesian; I wish it wasn’t so, but right now we ask for papers in English.”

Bahasa Indonesia, originally trade Malay, is the official language. It was introduced to unify the archipelagic nation with more than 300 indigenous tongues. Outside Indonesia and Malaysia it is rarely heard.

English is widely taught, although not always well. Adrian Vickers, professor of Southeast Asian Studies at Sydney University, has written that “the low standard of English remains one of the biggest barriers against Indonesia being internationally competitive.

“… in academia, few lecturers, let alone students, can communicate effectively in English, meaning that writing of books and journal articles for international audiences is almost impossible.”

Though the commercial publishers still dominate there are now almost 10,000 open-access peer-reviewed journals on the internet.

“Tuwhera has enhanced global access to specialist research in ways that could not previously have happened,” says Dr Robie. “We can also learn much from Indonesia and one of the best ways is through exchange programmes.”

This article was first published in Strategic Review and is republished with the author Duncan Graham’s permission. Graham blogs at indonesianow.blogspot.co.nz

Categorized in How to

Over the past half-decade I’ve written extensively about web archiving, including why we need to understand what’s in our massive archives of the web, whether our archives are failing to capture the modern and social web, the need for archives to modernize their technology infrastructures and, perhaps most intriguingly for the world of “big data,” how archives can make their petabytes of holdings available for research. What might it look like if the world’s web archives opened up their collections for academic research, making hundreds of billions of web objects totaling tens of petabytes and stretching back to the founding of the modern web available as a massive shared corpus to power the modern data mining revolution, from studies of the evolution of the web to powering the vast training corpuses required to build today’s cutting edge neural networks?

When it comes to crawling the open web to build large corpuses for data mining, universities in the US and Canada have largely adopted a hands-offapproach, exempting most work from ethical review, granting permission to ignore terms of use or copyright restrictions and waiving traditional policies on data management and replication on the grounds that material harvested from the open web is publicly accessible information and that its copyright owners, by virtue of being making it available on the web without password protection, encourage its access and use.

On the other hand, the world’s non-profit and governmental web archives, whom collectively hold tens of petabytes of archived content crawled from the open web stretching back 20+ years, have as a whole largely resisted opening their collections to bulk academic research. Many provide no access at all to their collections, some provide access only on a case-by-case basis and others provide access to a single page at a time, with no facilities for bulk exporting large portions of their holdings or even analyzing them in situ.

While some archives have cited technical limitations in making their content more accessible, the most common argument against offering bulk data mining access revolves around copyright law and concern that by boxing up gigabytes, terabytes or even petabytes of web content and redistributing it to researchers, web archives could potentially be viewed as “redistributing” copyrighted content. Given the growing interest among large content holders in licensing their material for precisely such bulk data mining efforts, some archives have expressed concern that traditional application of “fair use” doctrine in potentially permitting such data mining access may be gradually eroding.

Thus, paradoxically, research universities have largely adopted the stance that researchers are free to crawl the web and bulk download vast quantities of content to use in their data mining research, while web archives as a whole have adopted the stance that they cannot make their holdings available for data mining because they would, in their view, be “redistributing” the content they downloaded to third parties to use for data mining.

One large web archive has bucked this trend and stood alone among its peers: Common Crawl. Similar to other large web archiving initiatives like the Internet Archive, Common Crawl conducts regular web wide crawls of the open web and preserves all of the content it downloads in the standard WARC file format. Unlike many other archives, it focuses primarily on preserving HTML web pages and does not archive images, videos, JavaScript files, CSS stylesheets, etc. Its goal is not to preserve the exact look and feel of a website on a given snapshot in time, but rather to collect a vast cross section of HTML web pages from across the web in a single place to enable large-scale data mining at web scale.

Yet, what makes Common Crawl so unique is that it makes everything it crawls freely available for download for research. Each month it conducts an open web crawl, boxes up all of the HTML pages it downloads and makes a set of WARC files and a few derivative file formats available for download.

Its most recent crawl, covering August 2017, contains more than 3.28 billion pages totaling 280TiB, while the previous month’s crawl contains 3.16 billion pages and 260TiB of content. The total collection thus totals tens of billions of pages dating back years and totaling more than a petabyte, with all of it instantly available for download to support an incredible diversity of web research.

Of course, without the images, CSS stylesheets, JavaScript files and other non-HTML content saved by preservation-focused web archives like the Internet Archive, this vast compilation of web pages cannot be used to reproduce a page’s appearance as it stood on a given point in time. Instead, it is primarily useful for large-scale data mining research, exploring questions like the linking structure of the web or analyzing the textual content of pages, rather than acting as a historical replay service.

The project excludes sites which have robots.txt exclusion policies, following the historical policy of many other web archives, though it is worth noting that the Internet Archive earlier this year began slowly phasing out its reliance on such files due to their detrimental effect on preservation completeness. Common Crawl also allows sites to request removal from their index. Other than these cases, Common Crawl attempts to crawl as much of the remaining web as possible, aiming for a representative sample of the open web.

Moreover, Common Crawl has made its data publicly available for more than half a decade and has become a staple of large academic studies of the web with high visibility in the research community, suggesting that its approach to copyright compliance and research access appears to be working for it.

Yet, beyond its summary and full terms of use documents, the project has published little in terms of how it views its work fitting into US and international standards on copyright and fair use, so I reached out Sara Crouse, Director of Common Crawl, to speak to how the project approaches copyright and fair use and any advice they might have for other web archives considering broadening access to their holdings for academic big data research.

Ms. Crouse noted the risk adverse nature of the web archiving community as a whole (historically many adhered and still adhere to a strict “opt in” policy requiring prior approval before crawling a site) and the unwillingness of many archives to modernize their thinking on copyright and to engage more closely with the legal community in ways that could help them expand fair use horizons. In particular, she noted “since we [in the US] are beholden to the Copyright Act, while living in a digital age, many well-intentioned organizations devoted to web science, archiving, and information provision may benefit from a stronger understanding of how copyright is interpreted in present day, and its hard boundaries” and that “many talented legal advisers and groups are interested in the precedent-setting nature of this topic; some are willing to work Pro Bono.”

Given that US universities as a whole have moved aggressively towards this idea of expanding the boundaries of fair use and permitting opt-out bulk crawling of the web to compile research datasets, Common Crawl seems to be in good company when it comes to interpreting fair use for the digital age and modern views on utilizing the web for research.

Returning to the difference between Common Crawl’s datasets and traditional preservation-focused web archiving, Ms. Crouse emphasized that they capture only HTML pages and exclude multimedia content like images, video and other dynamic content.

She noted that a key aspect of their approach to fair use is that web pages are intended for consumption by human beings one at a time using a web browser, while Common Crawl concatenates billions of pages together in the specialized WARC file format designed for machine data mining. Specifically, “Common Crawl does not offer separate/individual web pages for easy consumption. The three data formats that are provided include text, metadata, and raw data, and the data is concatenated” and “the format of the output is not a downloaded web page. The output is in WARC file format which contains the components of a page that are beneficial to machine-level analysis and make for space- efficient archiving (essentially: header, text, and some metadata).”

In the eyes of Common Crawl, the use of specialized archival-oriented file formats like WARC (which is the format of choice of most web archives) limit the content’s use to transformative purposes like data mining and, combined with the lack of capture of styling, image and other visual content, renders the captured pages unsuitable to human browsing, transforming them from their originally intended purpose of human consumption.

As Ms. Crouse put it, “this is big data intended for machine learning/readability. Further, our intention for its use is for public benefit i.e. to encourage research and innovation, not direct consumption.” She noted that “from the layperson’s perspective, it is not at all trivial at present to extract a specific website’s content (that is, text) from a Common Crawl dataset. This task generally requires one to know how to install and run a Hadoop cluster, among other things. This is not structured data. Further it is likely that not all pages of that website will be included (depending on the parameters for depth set for the specific crawl).” This means that “the bulk of [Common Crawl’s] users are from the noncommercial, educational, and research sectors. At a higher level, it’s important to note that we provide a broad and representative sample of the web, in the form of web crawl data, each month. No one really knows how big the web is, and at present, we limit our monthly data publication to approximately 3 billion pages.”

Of course, given that content owners are increasingly looking to bulk data mining access licensing as a revenue stream, this raises the concern that even if web archives are transforming content designed for human consumption into machine friendly streams designed for data mining, such transformation may conflict with copyright holders’ own bulk licensing ambitions. For example, many of the large content licensors like LexisNexis, Factiva and Bloomberg all offer licensed commercial bulk feeds designed to support data mining access that pay royalty fees to content owners for their material that is used.

Common Crawl believes it addresses this through the fact that its archive represents only a sample of each website crawled, rather than striving for 100% coverage. Specifically, Ms. Crouse noted that “at present, [crawls are] in monthly increments that are discontinuous month-to-month. We do only what is reasonable, necessary, and economical to achieve a representative sample. For instance, we limit the number of pages crawled from any given domain so, for large content owners, it is highly probable that their content, if included in a certain month’s crawl data, is not wholly represented and thus not ideal for mining for comprehensive results … if the content owner is not a large site, or in a niche market, their URL is less likely to be included in the seeds in the frontier, and, since we limit depth (# of links followed) for the sake of both economy and broader representative web coverage, 'niche' content may not even appear in a given month’s dataset.”

To put it another way, Common Crawl’s mission is to create a “representative sample” of the web at large by crawling a sampling of pages and limiting the number of pages from each site they capture. Thus, their capture of any given site will represent a discontinuous sampling of pages that can change from month to month. A researcher wishing to analyze a single web site in its entirety would therefore not be able to turn to Common Crawl and would instead have to conduct their own crawl of the site or turn to a commercial aggregator that partners with the content holder to license the complete contents of the site.

In Common Crawl’s view this is a critical distinction that sets it apart from both traditional web archiving and the commercial content aggregators that generate data mining revenue for content owners. By focusing on creating a “representative sample” of the web at large, rather than attempting to capture a single site in its entirety (and in fact ensuring that it does not include more than a certain number of pages per site), the crawl self-limits itself to being applicable only to macro-level research examining web scale questions. Such “web scale” questions cannot be answered through any existing open dataset and by incorporating specific design features Common Crawl ensures that more traditional research questions, like data mining the entirety of a single site, which might be viewed as redistribution of that site or competing with its owner’s ability to license its content for data mining, is simply not possible.

Thus, to summarize, Common Crawl is both similar to other web archives in its workflow of crawling the web and archiving what it finds, but sets itself apart by focusing on creating a representative sample of HTML pages from across the entire web, rather than trying to preserve the entirety of a specific set of websites with an eye towards visual and functional preservation. Even when a given page is contained in Common Crawl’s archives, the technical sophistication and effort required to extract it and the lack of supporting CSS, JavaScript and image/video files renders the capture useless for the kind of non-technical browser-based access and interaction such pages are designed for.

Of course, copyright and what counts as "fair use" is a notoriously complex, contradictory, contested and ever-changing field and only time will tell whether Common Crawl’s interpretation of fair use holds up and becomes a standard that other web archives follow. At the very least, however, Common Crawl presents a powerful and intriguing model for how web-scale data can power open data research and offers traditional web archives a set of workflows, rationales and precedent to examine that are fully aligned with those of the academic community. Given its popularity and continued growth over the past decade it is clear that Common Crawl’s model is working and that many of its underlying approaches are highly applicable to the broader web archiving community.

Putting this all together, today’s web archives preserve for future generations the dawn of our digital society, but lock those tens of petabytes of documentary holdings away in dark archives or permit only a page at a time to be accessed. Common Crawl’s success and the projects that have been built upon its data stands testament to the incredible possibilities when such archives are unlocked and made available to the research community. Perhaps as the web archiving community modernizes and “open big data” continues to reshape how academic research is conducted, more web archives will follow Common Crawl’s example and explore ways of shaping the future of fair use and gradually opening their doors to research, all while ensuring that copyright and the rights of content holders are respected.

Source: This article was published forbes.com By Kalev Leetaru,

Categorized in Online Research
Page 1 of 2

Get Exclusive Research Tips in Your Inbox

Receive Great tips via email, enter your email to Subscribe.
Please wait

airs logo

Association of Internet Research Specialists is the world's leading community for the Internet Research Specialist and provide a Unified Platform that delivers, Education, Training and Certification for Online Research.

Newsletter Subscription

Receive Great tips via email, enter your email to Subscribe.
Please wait

Follow Us on Social Media

Book Your Seat for Webinar GET FREE REGISTRATION FOR MEMBERS ONLY      Register Now