Tutorial 5 - Web Scraping¶

This tutorial is adapted from the original tutorial Guide to Web Scraping in the Complete Python 3 Bootcamp by Pierian Data.

Content Copyright by Pierian Data

Before we begin with web scraping and Python, here are some important rules to follow and understand:

Always be respectful and try to get permission to scrape. Do not bombard a website with scraping requests, otherwise your IP address may be blocked.
Be aware that websites change often, meaning your code could go from working to totally broken from one day to the next.
Pretty much every web scraping project is a unique and custom job, so try your best to generalize the skills learned here.

Basic components of a WebSite¶

HTML¶

HTML stands for Hypertext Markup Language and every website on the Internet uses it to display information. Even the Jupyter Notebook system uses it to display this information in your browser. If you right-click on a website and select “View Page Source” you can see the raw HTML of a web page. This is the information that Python will be looking at to grab information from. Let’s take a look at the HTML organization of a simple webpage:

<!DOCTYPE html>
<html>
    <head>
        <title>Title on Browser Tab</title>
    </head>
    <body>
        <h1> Website Header </h1>
        <p> Some Paragraph </p>
    </body>
</html>

Let’s break down these components.

Every <tag> indicates a specific block type on the webpage:

The <!DOCTYPE html> is the tag for type declaration that HTML documents always start with this, letting the browser know it is an HTML file.
The component blocks of the HTML document are placed between <html> and </html>.
Metadata and script connections (like a link to a CSS file or a JS file) are often placed in the <head> block.
The <title> tag block defines the title of the webpage (this is what shows up in the tab of a website you’re visiting).
Between <body> and </body> tags are the blocks that will be visible to the site visitor.
Headings are defined by the <h1> through <h6> tags, where the number represents the size of the heading.
Paragraphs are defined by the <p> tag, and this is essentially just normal text on the website.

There are many more tags than these, such as <a> for hyperlinks, <table> for tables, <tr> for table rows, <td> for table columns, and more.

CSS¶

CSS stands for Cascading Style Sheets, and it is what gives “style” to a website, including colors and fonts, and even some animations. CSS uses tags such as <id> or <class> to connect an HTML element to a CSS feature, such as a particular color. The id tag is a unique id for an HTML tag and must be unique within the HTML document. The class tag defines a general style that can be linked to multiple HTML tags. Basically if you only want a single html tag to be read, you would use an id tag, and if you wanted several HTML tags/blocks to be read, you would create a class in your CSS doc, and then link it to the rest of these blocks.

Web Scraping with Python¶

Keep in mind again that you should always have permission for the website you are scraping. Check a website’s terms and conditions for more info. Also keep in mind that a computer can send requests to a website very fast, so a website may block your computer’s IP address if you send too many requests too quickly.

There are a few libraries you will need for this task, which you can install with conda install if you are using Anaconda distribution.

conda install requests
conda install lxml
conda install bs4

If you are not using Anaconda distribution, you can use pip install.

pip install requests
pip install lxml
pip install bs4

[1]:

# Install libraries
!pip install requests
!pip install lxml
!pip install bs4

Requirement already satisfied: requests in c:\users\vakanski\anaconda3\lib\site-packages (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\vakanski\anaconda3\lib\site-packages (from requests) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\users\vakanski\anaconda3\lib\site-packages (from requests) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\vakanski\anaconda3\lib\site-packages (from requests) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\vakanski\anaconda3\lib\site-packages (from requests) (2023.7.22)
Requirement already satisfied: lxml in c:\users\vakanski\anaconda3\lib\site-packages (4.9.2)
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: beautifulsoup4 in c:\users\vakanski\anaconda3\lib\site-packages (from bs4) (4.12.2)
Requirement already satisfied: soupsieve>1.2 in c:\users\vakanski\anaconda3\lib\site-packages (from beautifulsoup4->bs4) (2.4)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py): started
  Building wheel for bs4 (setup.py): finished with status 'done'
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1264 sha256=35474541650220b5e135d4bc9cb76cfa5caba3ae35e2c45fe611f4ecb52838ec
  Stored in directory: c:\users\vakanski\appdata\local\pip\cache\wheels\d4\c8\5b\b5be9c20e5e4503d04a6eac8a3cd5c2393505c29f02bea0960
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1

Example Task 0 - Grabbing the Title of a Page¶

Let’s start very simple and grab the title of a page. Remember that this is the HTML block with the title tag. For this task, we will use www.example.com which is a website specifically made to serve as an example domain. Let’s go through the main steps:

[2]:

import requests

[3]:

# Step 1: Use the requests library to grab the page
# Note, this may fail if you have a firewall blocking Python/Jupyter
# Note that sometimes you need to run this twice if it fails the first time
res = requests.get("http://www.example.com")

The type of the object res is a requests.models.Response object, and it actually contains the information from the website as shown in the following cell.

[4]:

type(res)

[4]:

requests.models.Response

[5]:

res.text

[5]:

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.</p>\n    <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n</body>\n</html>\n'

To analyze the infromation extracted from the webpage, we use the library BeautifulSoup. Technically, we could use our own custom script to loook for items in the string of res.text, however the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of an HTML file. Using BeautifulSoup we can create a “soup” object that contains all the “ingredients” of the webpage.

[6]:

import bs4

[7]:

soup = bs4.BeautifulSoup(res.text)

[8]:

soup

[8]:

<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

Now let’s use the .select() method to grab elements. Since we are looking for the 'title' tag, so we will pass in 'title'.

[9]:

soup.select('title')

[9]:

[<title>Example Domain</title>]

Notice that the returned object is actually a list containing all the title along with the tags. Since this object it still a specialized tag, we can use getText() to grab just the text.

[10]:

title_tag = soup.select('title')

[11]:

title_tag[0]

[11]:

<title>Example Domain</title>

[12]:

type(title_tag[0])

[12]:

bs4.element.Tag

[13]:

title_tag[0].getText()

[13]:

'Example Domain'

Example Task 2 - Grabbing All Elements of a Class¶

Let’s try to grab all the section headings of the Wikipedia Article on Grace Hopper from this URL: https://en.wikipedia.org/wiki/Grace_Hopper

[14]:

# Get the request
res = requests.get('https://en.wikipedia.org/wiki/Grace_Hopper')

[15]:

# Create a soup from request
# The second parameter "lxml" is the type of parser to use.
# lxml is a faster alternative to python's native HTML parser.
soup = bs4.BeautifulSoup(res.text,"lxml")

To figure out what we are actually looking for, let’s inspect the elements on the page.

Syntax to pass to the .select() method

Match Results

soup.select(‘div’)

All elements with the <div> tag

soup.select(‘#some_id’)

The HTML element containing the id attribute of some_id

soup.select(‘.notice’)

All the HTML elements with the CSS class named notice

soup.select(‘div span’)

Any elements named <span> that are within an element named <div>

soup.select(‘div > span’)

Any elements named <span> that are directly within an element named <div>, with no other element in between

The section headers have the class "mw-headline". Because this is a class and not a straight tag, we need to adhere to some syntax for CSS.

[17]:

# note that depending on your IP Address,
# this class may be called something different
soup.select(".mw-headline")

[17]:

[<span class="mw-headline" id="Early_life_and_education">Early life and education</span>,
 <span class="mw-headline" id="Career">Career</span>,
 <span class="mw-headline" id="World_War_II">World War II</span>,
 <span class="mw-headline" id="UNIVAC">UNIVAC</span>,
 <span class="mw-headline" id="COBOL">COBOL</span>,
 <span class="mw-headline" id="Standards">Standards</span>,
 <span class="mw-headline" id="Retirement">Retirement</span>,
 <span class="mw-headline" id="Post-retirement">Post-retirement</span>,
 <span class="mw-headline" id="Anecdotes">Anecdotes</span>,
 <span class="mw-headline" id="Death">Death</span>,
 <span class="mw-headline" id="Dates_of_rank">Dates of rank</span>,
 <span class="mw-headline" id="Awards_and_honors">Awards and honors</span>,
 <span class="mw-headline" id="Military_awards">Military awards</span>,
 <span class="mw-headline" id="Other_awards">Other awards</span>,
 <span class="mw-headline" id="Legacy">Legacy</span>,
 <span class="mw-headline" id="Places">Places</span>,
 <span class="mw-headline" id="Programs">Programs</span>,
 <span class="mw-headline" id="In_popular_culture">In popular culture</span>,
 <span class="mw-headline" id="Grace_Hopper_Celebration_of_Women_in_Computing">Grace Hopper Celebration of Women in Computing</span>,
 <span class="mw-headline" id="See_also">See also</span>,
 <span class="mw-headline" id="Notes">Notes</span>,
 <span class="mw-headline" id="References">References</span>,
 <span class="mw-headline" id="Obituary_notices">Obituary notices</span>,
 <span class="mw-headline" id="Further_reading">Further reading</span>,
 <span class="mw-headline" id="External_links">External links</span>]

[18]:

for item in soup.select(".mw-headline"):
    print(item.text)

Early life and education
Career
World War II
UNIVAC
COBOL
Standards
Retirement
Post-retirement
Anecdotes
Death
Dates of rank
Awards and honors
Military awards
Other awards
Legacy
Places
Programs
In popular culture
Grace Hopper Celebration of Women in Computing
See also
Notes
References
Obituary notices
Further reading
External links

Example Task 3 - Getting an Image from a Website¶

Let’s attempt to grab the image of the Deep Blue Computer from this Wikipedia article: https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)

[19]:

res = requests.get("https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)")

[20]:

soup = bs4.BeautifulSoup(res.text,'lxml')

[23]:

# Find all tags with 'img' sub-string
img_tags = soup.find_all('img')

[24]:

# Find the tag with 'Deep_Blue' sub-string, which is the name of the computer.
for img in img_tags:
    # print(img['src'])
    if 'Deep_Blue' in img['src']:
        print(img['src'])

//upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/220px-Deep_Blue.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/8/83/One_of_Deep_Blue%27s_processors_%282586060990%29.jpg/220px-One_of_Deep_Blue%27s_processors_%282586060990%29.jpg

We can actually display it with a markdown cell with the following:

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/220px-Deep_Blue.jpg'>

Now that we have the actual link, we can grab the image with requests and get, along with the .content attribute. Note how we had to add https:// before the link, because if you don’t do this, requests will complain (but it gives you a pretty descriptive error code).

[25]:

image_link = requests.get('https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/220px-Deep_Blue.jpg')

[ ]:

# The raw content (it's a binary file, meaning we will need to use binary read/write methods for saving it)
image_link.content

Let’s write the image to a jpg file. Note the 'wb' call to denote a binary writing of the file.

[27]:

f = open('my_new_file_name.jpg','wb')

[28]:

f.write(image_link.content)

[28]:

[29]:

f.close()

Now we can display this file right here in the notebook as markdown using:

<img src="image/my_new_file_name.jpg" width="300">

Just write the above line in a new markdown cell and it will display the image we just downloaded.

0f557e5555264ce1804aa9aabf21eb77

Example Project - Working with Multiple Pages and Items¶

Let’s show a more realistic example of scraping a full site. The website: http://books.toscrape.com/index.html is specifically designed for people to scrape it. Let’s try to get the title of every book that has a 2 star rating and return a Python list with all their titles.

We will do the following:

Figure out the URL structure to go through every page
Scrape every page in the catalog
Figure out what tag/class represents the Star rating
Filter by that star rating using an if statement
Store the results in a list

We can see that the URL structure is the following:

http://books.toscrape.com/catalogue/page-1.html

[30]:

base_url = 'http://books.toscrape.com/catalogue/page-{}.html'

We can then fill in the page number with .format().

[31]:

res = requests.get(base_url.format('1'))

Now let’s grab the products (books) from the get request result.

[32]:

soup = bs4.BeautifulSoup(res.text,"lxml")

[ ]:

soup.select(".product_pod")

Now we can see that each book has the product_pod class. We can select any tag with this class, and then further reduce it by its rating.

[33]:

products = soup.select(".product_pod")

[34]:

example = products[0]

[35]:

type(example)

[35]:

bs4.element.Tag

[36]:

example.attrs

[36]:

{'class': ['product_pod']}

By inspecting the webpage we can see that the class we want is class='star-rating Two'. If you click on this in your browser, you’ll notice it displays the space as a dot . , so that means we want to search for ".star-rating.Two".

[37]:

list(example.children)

[37]:

['\n',
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>,
 '\n',
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 '\n',
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>,
 '\n',
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>

         In stock

 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>,
 '\n']

[38]:

example.select('.star-rating.Three')

[38]:

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>]

But we are looking for 2 stars, so it looks like we can just check to see if something was returned.

[39]:

example.select('.star-rating.Two')

[39]:

[]

Alternatively, we can just quickly check the text string to see if "star-rating Two" is in it. Either approach is fine (there are also many other alternative approaches).

Now let’s see how we can get the title if we have a 2-star match:

[40]:

example.select('a')

[40]:

[<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>,
 <a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

[41]:

example.select('a')[1]

[41]:

<a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>

[42]:

example.select('a')[1]['title']

[42]:

'A Light in the Attic'

Be aware that a firewall may prevent this script from running. Also if you are getting a no response error, maybe try adding a sleep step with time.sleep(1).

[43]:

two_star_titles = []

for n in range(1,51):

    scrape_url = base_url.format(n)
    res = requests.get(scrape_url)

    soup = bs4.BeautifulSoup(res.text,"lxml")
    books = soup.select(".product_pod")

    for book in books:
        if len(book.select('.star-rating.Two')) != 0:
            two_star_titles.append(book.select('a')[1]['title'])

[44]:

two_star_titles

[44]:

['Starving Hearts (Triangular Trade Trilogy, #1)',
 'Libertarianism for Beginners',
 "It's Only the Himalayas",
 'How Music Works',
 'Maude (1883-1993):She Grew Up with the country',
 "You can't bury them all: Poems",
 'Reasons to Stay Alive',
 'Without Borders (Wanderlove #1)',
 'Soul Reader',
 'Security',
 'Saga, Volume 5 (Saga (Collected Editions) #5)',
 'Reskilling America: Learning to Labor in the Twenty-First Century',
 'Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics',
 'Obsidian (Lux #1)',
 'My Paris Kitchen: Recipes and Stories',
 'Masks and Shadows',
 'Lumberjanes, Vol. 2: Friendship to the Max (Lumberjanes #5-8)',
 'Lumberjanes Vol. 3: A Terrible Plan (Lumberjanes #9-12)',
 'Judo: Seven Steps to Black Belt (an Introductory Guide for Beginners)',
 'I Hate Fairyland, Vol. 1: Madly Ever After (I Hate Fairyland (Compilations) #1-5)',
 'Giant Days, Vol. 2 (Giant Days #5-8)',
 'Everydata: The Misinformation Hidden in the Little Data You Consume Every Day',
 "Don't Be a Jerk: And Other Practical Advice from Dogen, Japan's Greatest Zen Master",
 'Bossypants',
 'Bitch Planet, Vol. 1: Extraordinary Machine (Bitch Planet (Collected Editions))',
 'Avatar: The Last Airbender: Smoke and Shadow, Part 3 (Smoke and Shadow #3)',
 'Tuesday Nights in 1980',
 'The Psychopath Test: A Journey Through the Madness Industry',
 'The Power of Now: A Guide to Spiritual Enlightenment',
 "The Omnivore's Dilemma: A Natural History of Four Meals",
 'The Love and Lemons Cookbook: An Apple-to-Zucchini Celebration of Impromptu Cooking',
 'The Girl on the Train',
 'The Emerald Mystery',
 'The Argonauts',
 'Suddenly in Love (Lake Haven #1)',
 'Soft Apocalypse',
 "So You've Been Publicly Shamed",
 'Shoe Dog: A Memoir by the Creator of NIKE',
 'Louisa: The Extraordinary Life of Mrs. Adams',
 'Large Print Heart of the Pride',
 'Grumbles',
 'Chasing Heaven: What Dying Taught Me About Living',
 'Becoming Wise: An Inquiry into the Mystery and Art of Living',
 'Beauty Restored (Riley Family Legacy Novellas #3)',
 'Batman: The Long Halloween (Batman)',
 "Ayumi's Violin",
 'Wild Swans',
 "What's It Like in Space?: Stories from Astronauts Who've Been There",
 'Until Friday Night (The Field Party #1)',
 'Unbroken: A World War II Story of Survival, Resilience, and Redemption',
 'Twenty Yawns',
 'Through the Woods',
 'This Is Where It Ends',
 'The Year of Magical Thinking',
 'The Last Mile (Amos Decker #2)',
 'The Immortal Life of Henrietta Lacks',
 'The Hidden Oracle (The Trials of Apollo #1)',
 'The Guilty (Will Robie #4)',
 'Red Hood/Arsenal, Vol. 1: Open for Business (Red Hood/Arsenal #1)',
 'Once Was a Time',
 'No Dream Is Too High: Life Lessons From a Man Who Walked on the Moon',
 'Naruto (3-in-1 Edition), Vol. 14: Includes Vols. 40, 41 & 42 (Naruto: Omnibus #14)',
 'More Than Music (Chasing the Dream #1)',
 'Lowriders to the Center of the Earth (Lowriders in Space #2)',
 'Eat Fat, Get Thin',
 'Doctor Sleep (The Shining #2)',
 'Crazy Love: Overwhelmed by a Relentless God',
 'Carrie',
 'Batman: Europa',
 'Angels Walking (Angels Walking #1)',
 'Adulthood Is a Myth: A "Sarah\'s Scribbles" Collection',
 'A Study in Scarlet (Sherlock Holmes #1)',
 'A Series of Catastrophes and Miracles: A True Story of Love, Science, and Cancer',
 "A People's History of the United States",
 'My Kitchen Year: 136 Recipes That Saved My Life',
 'The Lonely City: Adventures in the Art of Being Alone',
 'The Dinner Party',
 'Stars Above (The Lunar Chronicles #4.5)',
 'Love, Lies and Spies',
 'Troublemaker: Surviving Hollywood and Scientology',
 'The Widow',
 'Setting the World on Fire: The Brief, Astonishing Life of St. Catherine of Siena',
 'Mothering Sunday',
 'Lilac Girls',
 '10% Happier: How I Tamed the Voice in My Head, Reduced Stress Without Losing My Edge, and Found Self-Help That Actually Works',
 'Underlying Notes',
 'The Flowers Lied',
 'Modern Day Fables',
 "Chernobyl 01:23:40: The Incredible True Story of the World's Worst Nuclear Disaster",
 '23 Degrees South: A Tropical Tale of Changing Whether...',
 'When Breath Becomes Air',
 'Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel',
 'The Martian (The Martian #1)',
 "Miller's Valley",
 "Love That Boy: What Two Presidents, Eight Road Trips, and My Son Taught Me About a Parent's Expectations",
 'Left Behind (Left Behind #1)',
 'Howl and Other Poems',
 "Heaven is for Real: A Little Boy's Astounding Story of His Trip to Heaven and Back",
 "Brazen: The Courage to Find the You That's Been Hiding",
 '32 Yolks',
 'Wildlife of New York: A Five-Borough Coloring Book',
 'Unreasonable Hope: Finding Faith in the God Who Brings Purpose to Your Pain',
 'The Art Book',
 'Steal Like an Artist: 10 Things Nobody Told You About Being Creative',
 'Raymie Nightingale',
 'Like Never Before (Walker Family #2)',
 'How to Be a Domestic Goddess: Baking and the Art of Comfort Cooking',
 'Finding God in the Ruins: How God Redeems Pain',
 'Chronicles, Vol. 1',
 'A Summer In Europe',
 'The Rise and Fall of the Third Reich: A History of Nazi Germany',
 'The Makings of a Fatherless Child',
 'The Fellowship of the Ring (The Lord of the Rings #1)',
 "Tell the Wolves I'm Home",
 'In the Woods (Dublin Murder Squad #1)',
 'Give It Back',
 'Why Save the Bankers?: And Other Essays on Our Economic and Political Crisis',
 'The Raven King (The Raven Cycle #4)',
 'The Expatriates',
 'The 5th Wave (The 5th Wave #1)',
 'Peak: Secrets from the New Science of Expertise',
 'Logan Kade (Fallen Crest High #5.5)',
 "I Know Why the Caged Bird Sings (Maya Angelou's Autobiography #1)",
 'Drama',
 "America's War for the Greater Middle East: A Military History",
 'A Game of Thrones (A Song of Ice and Fire #1)',
 "The Pilgrim's Progress",
 'The Hound of the Baskervilles (Sherlock Holmes #5)',
 "The Geography of Bliss: One Grump's Search for the Happiest Places in the World",
 'The Demonists (Demonist #1)',
 'The Demon Prince of Momochi House, Vol. 4 (The Demon Prince of Momochi House #4)',
 'Misery',
 'Far From True (Promise Falls Trilogy #2)',
 'Confessions of a Shopaholic (Shopaholic #1)',
 'Vegan Vegetarian Omnivore: Dinner for Everyone at the Table',
 'Two Boys Kissing',
 'Twilight (Twilight #1)',
 'Twenties Girl',
 'The Tipping Point: How Little Things Can Make a Big Difference',
 'The Stand',
 'The Picture of Dorian Gray',
 'The Name of God is Mercy',
 "The Lover's Dictionary",
 'The Last Painting of Sara de Vos',
 'The Guns of August',
 'The Girl Who Played with Fire (Millennium Trilogy #2)',
 'The Da Vinci Code (Robert Langdon #2)',
 'The Cat in the Hat (Beginner Books B-1)',
 'The Book Thief',
 'The Autobiography of Malcolm X',
 "Surely You're Joking, Mr. Feynman!: Adventures of a Curious Character",
 'Soldier (Talon #3)',
 'Shopaholic & Baby (Shopaholic #5)',
 'Seven Days in the Art World',
 'Rework',
 'Packing for Mars: The Curious Science of Life in the Void',
 'Orange Is the New Black',
 'One for the Money (Stephanie Plum #1)',
 'Midnight Riot (Peter Grant/ Rivers of London - books #1)',
 'Me Talk Pretty One Day',
 'Manuscript Found in Accra',
 'Lust & Wonder',
 "Life, the Universe and Everything (Hitchhiker's Guide to the Galaxy #3)",
 'Life After Life',
 'I Am Malala: The Girl Who Stood Up for Education and Was Shot by the Taliban',
 'House of Lost Worlds: Dinosaurs, Dynasties, and the Story of Life on Earth',
 'Horrible Bear!',
 'Holidays on Ice',
 'Girl in the Blue Coat',
 'Fruits Basket, Vol. 3 (Fruits Basket #3)',
 'Cosmos',
 'Civilization and Its Discontents',
 "Catastrophic Happiness: Finding Joy in Childhood's Messy Years",
 'Career of Evil (Cormoran Strike #3)',
 'Born to Run: A Hidden Tribe, Superathletes, and the Greatest Race the World Has Never Seen',
 "Best of My Love (Fool's Gold #20)",
 'Beowulf',
 'Awkward',
 'And Then There Were None',
 'A Storm of Swords (A Song of Ice and Fire #3)',
 'The Suffragettes (Little Black Classics, #96)',
 'Vampire Girl (Vampire Girl #1)',
 'Three Wishes (River of Time: California #1)',
 'The Wicked + The Divine, Vol. 1: The Faust Act (The Wicked + The Divine)',
 'The Little Prince',
 'The Last Girl (The Dominion Trilogy #1)',
 'Taking Shots (Assassins #1)',
 'Settling the Score (The Summer Games #1)',
 'Rhythm, Chord & Malykhin',
 'One Second (Seven #7)',
 "Old Records Never Die: One Man's Quest for His Vinyl and His Past",
 'Of Mice and Men',
 'My Perfect Mistake (Over the Top #1)',
 'Meditations',
 'Frankenstein',
 'Emma']

Now you should have the tools necessary to scrape any websites that interest you.

Keep in mind that the more complex the website, the harder it will be to scrape.

And always ask for permission!

BACK TO TOP