fbpx
Thursday, 07 May 2020 13:10

How To Scrape the Dark Web

Author:  [Source: This article was published in towardsdatascience.com By Mitchell Telatnik]

Scraping the Dark Web using Python, Selenium, and TOR on Mac OSX

Warning: Accessing the dark web can be dangerous! Please continue at your own risk and take necessary security precautions such as disabling scripts and using a VPN service.

Introduction

Finding Hidden Services

Method 1: Directories

Method 2: Snowball Sampling

Environment Setup

TOR Browser

VPN

Python

Pandas

pip install pandas

Selenium

pip install selenium

Geckodriver

Firefox Binary

Implementation

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import pandas as pd
binary = FirefoxBinary(*path to your firefox binary*)
driver = webdriver.Firefox(firefox_binary = binary)
url = *your url*
driver.get(url)

Basic Selenium Scraping Techniques

Finding Elements

driver.find_element_by_class_name("postMain")

driver.find_element_by_xpath('/html/body/div/div[2]/div[2]/div/div[1]/div/a[1]')
driver.find_elements_by_class_name("postMain")

Getting the Text of an Element

driver.find_element_by_class_name('postContent').text

Storing Elements

post_content_list = []
postText = driver.find_element_by_class_name('postContent').text
post_content_list.append(postText)

Crawling Between Pages

for i in range(1, MAX_PAGE_NUM + 1):
page_num = i
url = '*first part of url*' + str(page_num) + '*last part of url*'
driver.get(url)

Exporting to CSV File

df['postURL'] = post_url_list
df['author'] = post_author_list
df['postTitle'] = post_title_list
df.to_csv('scrape.csv')

Anti-crawling Measures

captcha.png

driver.implicitly_wait(10000)
driver.find_element_by_class_name("postMain")
import pandas as pddf = pd.read_csv('scrape.csv')
df2 = pd.read_csv('scrape2.csv')
df3 = pd.read_csv('scrape3.csv')
df4 = pd.read_csv('scrape4.csv')
df5 = pd.read_csv('scrape5.csv')
df6 = pd.read_csv('scrape6.csv')
frames = [df, df2, df3, df4, df5, df6]result = pd.concat(frames, ignore_index = True)result.to_csv('ForumScrape.csv')

Discussion

[Source: This article was published in towardsdatascience.com By Mitchell Telatnik - Uploaded by the Association Member: Deborah Tannen]

Leave a comment

airs logo

Association of Internet Research Specialists is the world's leading community for the Internet Research Specialist and provide a Unified Platform that delivers, Education, Training and Certification for Online Research.

Get Exclusive Research Tips in Your Inbox

Receive Great tips via email, enter your email to Subscribe.

Follow Us on Social Media