web_scraping

Web Scraping in Python In this appendix lecture we'll go over how to scrape information from the web using Python. We'll go to a website, decide what information we want, see where and how it is stored, then scrape it and set it as a pandas DataFrame! Some things you should consider before web scraping a website: 1.) You should check a site's terms and conditions before you scrape them. 2.) Space out your requests so you don't overload the site's server, doing this could get you blocked. 3.) Scrapers break after time - web pages change their layout all the time, you'll more than likely have to rewrite your code. 4.) Web pages are usually inconsistent, more than likely you'll have to clean up the data after scraping it. 5.) Every web page and situation is different, you'll have to spend time configuring your scraper. To learn more about HTML I suggest theses two resources: W3School Codecademy There are three modules we'll need in addition to python are: 1.) BeautifulSoup, which you can download by typing: pip install beautifulsoup4 or conda install beautifulsoup4 (for the Anaconda distrbution of Python) in your command prompt. 2.) lxml , which you can download by typing: pip install lxml or conda install lxml (for the Anaconda distrbution of Python) in your command prompt. 3.) requests, which you can download by typing: pip install requests or conda install requests (for the Anaconda distrbution of Python) in your command prompt. We'll start with our imports:

The Unlicense

Created on Feb 17, 2021

Updated on Nov 18, 2021

Stars

2

Forks

0

Watchers

2

Open Issues

0

Repository Health Score

❤️

40/100

Poor

Overall repository health assessment

Score Breakdown

Activity

Inactive - no updates in 3+ months

0/30

0%

Recent Commits

Add files via upload

venugopalpg96•5 years ago

ce9b79fView on GitHub

Initial commit

venugopalpg96•5 years ago

bd14668View on GitHub

View all commits

Community

2 stars, 0 forks

0/30

0%

Documentation

Has description, wiki, license

20/20

100%

Maintenance

0.0% issue ratio

20/20

100%

Health score is calculated based on activity, community engagement, documentation quality, and maintenance practices

Languages

Jupyter Notebook

100.0%

Dependencies

No package.json found

This might not be a Node.js project

Top Contributors

1

venugopalpg96

User

2

commits

Languages

Jupyter Notebook

100.0%

Dependencies

No package.json found

This might not be a Node.js project

Top Contributors

1

venugopalpg96

User

2

commits

Recent Commits

Add files via upload

venugopalpg96•5 years ago

ce9b79fView on GitHub

Initial commit

venugopalpg96•5 years ago

bd14668View on GitHub

View all commits

GitHub Explorer

web_scraping

Score Breakdown