How to Parse URLs in Python: A Comprehensive Guide with Examples
Parsing URLs (Uniform Resource Locators) is a common task in web development and data processing. URLs are the addresses that identify resources on the internet, such as web pages, images, and documents. Python provides powerful libraries and tools to easily parse and manipulate URLs, enabling developers to extract specific components and perform various operations on them.
Introduction to URL Parsing
URL parsing involves breaking down a URL into its constituent parts, such as the scheme (e.g., “http”), host (e.g., “www.example.com ”), path (e.g., “/page”), query parameters (e.g., “key=value”), and more. Python provides several libraries to facilitate URL parsing, each with its own set of features and capabilities.
Using the urllib.parse Module
The urllib.parse
module is part of the Python standard library and offers functions to parse and manipulate URLs. It provides a comprehensive set of tools for handling URLs.
Example: Parsing a Simple URL
>>> from urllib.parse import urlparse
>>>
>>> url = "https://www.example.com/page?query=value"
>>> parsed_url = urlparse(url)
>>> print("Scheme:", parsed_url.scheme)
Scheme: https
>>> print("Netloc:", parsed_url.netloc)
Netloc: www.example.com
>>> print("Path:", parsed_url.path)
Path: /page
>>> print("Query:", parsed_url.query)
Query: query=value
>>>
Example: Encode Query Parameters with urllib.parse.quote()
vs urllib.parse.quote_plus()
The urllib.parse.quote()
and urllib.parse.quote_plus()
functions from Python’s standard library are used to percent-encode strings, making them safe to use in URLs. Here’s how they differ:
urllib.parse.quote()
- This function is used to percent-encode a string.
- It replaces special characters in the string using the
%xx
escape. For example, aspace
is replaced with%20
. - It’s useful when you need to encode a path segment or the part of a query string before the
?
.
urllib.parse.quote_plus()
- Similar to
quote()
, but it replacesspace
with+
signs whilequote()
replacespace
with%20
. - It replace
/
with%2F
whilequote()
does not replace/
. - This behavior is particularly useful when encoding form data in the query component of a URL, mimicking the behavior of encoding spaces as + in URLs, which is common in the context of form submissions.
urllib.parse.quote()
vs urllib.parse.quote_plus()
examples:
>>> import urllib.parse
>>> urllib.parse.quote("https://example.com/path?foo=bar+abc&a=ABC%20XYZ 123")
'https%3A//example.com/path%3Ffoo%3Dbar%2Babc%26a%3DABC%2520XYZ%20123'
# `safe=''` also replace `/` with `%2F`
>>> urllib.parse.quote("https://example.com/path?foo=bar+abc&a=ABC%20XYZ 123", safe='')
'https%3A%2F%2Fexample.com%2Fpath%3Ffoo%3Dbar%2Babc%26a%3DABC%2520XYZ%20123'
>>> urllib.parse.quote_plus("https://example.com/path?foo=bar+abc&a=ABC%20XYZ 123")
'https%3A%2F%2Fexample.com%2Fpath%3Ffoo%3Dbar%2Babc%26a%3DABC%2520XYZ+123'
>>> url = "https://example.com/path?foo=bar+abc&a=ABC%20XYZ 123"
>>> url = f"https://example.com?url={urllib.parse.quote_plus(url)}"
>>> url
'https://example.com?url=https%3A%2F%2Fexample.com%2Fpath%3Ffoo%3Dbar%2Babc%26a%3DABC%2520XYZ+123'
Example: Extracting Query Parameters
>>> from urllib.parse import parse_qs
>>>
>>> query_string = "key1=value1&key2=value2&key3=value3"
>>> query_params = parse_qs(query_string)
>>>
>>> for key, values in query_params.items():
... print(key, ":", values)
...
key1 : ['value1']
key2 : ['value2']
key3 : ['value3']
Advanced URL Parsing with the furl Library
The furl library is a powerful and user-friendly option for parsing and manipulating URLs, including handling relative URLs.
pip3 install furl
Example: Handling Relative URLs
>>> from furl import furl
>>>
>>> base_url = "https://www.example.com/page/"
>>> relative_url = "../otherpage"
>>> absolute_url = furl(base_url).join(relative_url)
>>>
>>> print("Absolute URL:", absolute_url.url)
Absolute URL: https://www.example.com/otherpage
Example: Combining and Resolving URLs
>>> from furl import furl
>>>
>>> url1 = "https://www.example.com/page"
>>> url2 = "otherdir/otherpage"
>>> combined_url = furl(url1).join(url2)
>>>
>>> print("Combined URL:", combined_url.url)
Combined URL: https://www.example.com/otherdir/otherpage
Related pages:
- Get the IP address of a network interface by its name in Python on Linux an Mac OS X
- Python: How to print literal curly brace { or } in f-string and format string
- Python unicode string lowercase and caseless match
References
OmniLock - Block / Hide App on iOS
Block distractive apps from appearing on the Home Screen and App Library, enhance your focus and reduce screen time.
DNS Firewall for iOS and Mac OS
Encrypted your DNS to protect your privacy and firewall to block phishing, malicious domains, block ads in all browsers and apps