fbpx

Any document type that the publishing-API knows about can be added to our internal search. By default, all document types in internal search also get included in the GOV.UK sitemap, which tells external search engines about our content.

The app responsible for search is Rummager. Rummager listens to RabbitMQ messages about published documents to know when to index documents. For the new document type to be indexed, you need to add it to a whitelist.

Rummager has its own concept of document type, which represents the schema used to store documents in Elasticsearch (the search engine).

Normally, you’ll map your document type an existing rummager document type. If in doubt, use “edition” - this is used for most documents.

Then, modify mapped_document_types.yml with the mapping from the publishing API document type.

If you want a search to be able to use metadata that isn’t defined in any rummager document type, then you’ll need to add new fields to rummager.

Rummager knows how to handle most of the core fields from the publishing platform, like title, description, and public_updated_at. It looks at the body or parts fields to work out what text to make searchable. If your schema uses different fields to render the text of the page, update the IndexableContentPresenter as well.

The part of rummager that translates between publishing API fields and search fields are elasticsearch_presenter.rb. Modify this if there is anything special you want a search to do with your documents (for example: appending additional information to the title).

2. Add the document type to migrated_formats.yaml

Add the document_type name to the migrated list in rummager.

3. Reindex

Reindex the govuk index following the instructions in Reindex an Elasticsearch index

4. Republish all the documents

Republish all the documents. If they have been published already, you can republish them with the publishing-api represent_downstream rake task:

rake represent_downstream:document_type[new_document_type]

You can test that the documents appear in search through the API using a query such as:

Source: This article was published docs.publishing.service.gov.uk 

Categorized in Search Engine

Hoping to scour through public records and expose corruption, crime or wrongdoing? The Investigative Dashboard might be your best bet.

Developed by the Organized Crime and Corruption Reporting Project (OCCRP), Investigative Dashboard contains a number of tools and resources meant to make it easier for journalists and civil society researchers to investigate and expose corrupt individuals and businesses. Its investigative tools include databasesvisualization tools, and a search engine.

Journalists can also access the dashboard’s catalog of external databases, which links to more than 400 online databases in 120 countries and jurisdictions — from Afghanistan to Zimbabwe.

We spoke with developer Friedrich Lindenberg about getting started with the dashboard and using these tools to their full potential:

Searching for leads

Investigative Dashboard’s database houses more than 4 million documents, data sources and more that are sorted into 141 collections.

Journalists can use a custom-built search tool, Aleph — which Lindenberg built himself — to search the database either by specific terms related to their investigation or by category.

“Often as a journalist, you want to find out ‘Where can I find information about this person or this company?’” Lindenberg said. “What you want then is a place where you can search as many data sources as possible. That's why we're bringing together a lot of government data, corporate records and other kinds of information from previous investigations that we have exclusive access to, and all of that is searchable.”

Additionally, journalists can get email alerts for their chosen search terms so they’ll always be notified of new developments regarding the individuals or companies they’re investigating.

“As we get more and more data, what we can quite easily do is run your list of people that you're interested in against all these sources and see if there's new leads popping up,” Lindenberg said. “One of the things we're trying to do with Aleph is create incentives for people to write down names of people they’ve investigated previously or would like to know more about. Then we will continuously send you a feed of stuff that we dig up.”

Mapping and visualizing

Investigative Dashboard links to Visual Investigative Scenarios (VIS), a free data visualization platform built to show networks of business, corruption or crime, turning complex narratives into easy-to-understand visual depictions.

Journalists can input entities like people, companies, political parties or criminal organizations, then draw connections between them and attach documents as evidence. Once a visualization is complete, it can be exported for online, print or broadcast use.

Research support

Journalists can also directly ask OCCRP researchers to help them investigate companies or individuals of interest. OCCRP has access to certain commercial databases that may be prohibitively expensive for some journalists to use. While users can’t access these commercial databases via the Dashboard, OCCRP researchers are there to lend a helping hand, Lindenberg explained.

“One of the cool parts of this is that basically, as OCCRP, we've purchased subscriptions to some commercial databases that are otherwise inaccessible to journalists,” he said. “We can't give everybody access to them because then we'd break the terms of service, but what we can do is have our researchers look up the things you might be interested and then give you back the documents they find there.”

Once users with a Dashboard account submit a ticket describing the person or entity they’re investigating, OCCRP researchers will search these databases to see what, if anything, comes up.

Uploading files

OCCRP encourages journalists to upload their own documents and data using its personal archive tool. After creating an account, journalists can upload documents, create watchlists and organize their research. By default, all uploaded documents are private, but users can share their documents with others or make them public if they choose.

To make sure no false data or documents are uploaded to Aleph, OCCRP bots periodically crawl through public documents to verify and cross-reference them, Lindenberg explained.

Author : Sam Berkhead

Source : http://ijnet.org/en/blog/journalists-investigating-corruption-free-tool-offers-millions-searchable-documents

Categorized in Investigative Research

I know some people may wonder why I am explaining this. I have run across several people who were unaware of how it is easy to extract images from Word. This industry is involved with a lot of guest blogging, so I  thought help people save some time.

Extracting images from Word is very quick and easy, and here is how to do it.

Start With an Open Document

Once you have a document open you want to save it as a Web PageFiltered:

A Folder is Created

When you have saved a document as a “Web Page, Filtered” a folder is created in the location you choose to save the “Web Page” in. This folder will hold all the images in the document. Here is how the folder will look : (Note, the folder name is the same as the title of the document.)

Inside This Folder

The images in the original document will be in this folder, but the document itself will not, so make sure you save your document and your “Web Page, Filtered” in the same place.

As you can see in the image above, the document images will be automatically saved and given “image” names. You can always rename them if the author has a preference for names of images.

Saving a Word document in this manner can help you in another way. Say someone sends you a zipped file with specifically named images for a post, but there are like 10 of them and there is no number order. Organizing a post like this requires a lot of back and forth. If you extract the zipped images to this saved Web Page folder you will have the images listed in order and you can use them as a guide for the named images.

Source : https://www.searchenginejournal.com

Categorized in Others

airs logo

Association of Internet Research Specialists is the world's leading community for the Internet Research Specialist and provide a Unified Platform that delivers, Education, Training and Certification for Online Research.

Get Exclusive Research Tips in Your Inbox

Receive Great tips via email, enter your email to Subscribe.

Follow Us on Social Media

Book Your Seat for Webinar - GET 70% OFF FOR MEMBERS ONLY      Register Now