378 lines
13 KiB
Plaintext
378 lines
13 KiB
Plaintext
Metadata-Version: 2.1
|
|
Name: publicsuffix2
|
|
Version: 2.20191221
|
|
Summary: Get a public suffix for a domain name using the Public Suffix List. Forked from and using the same API as the publicsuffix package.
|
|
Home-page: https://github.com/nexb/python-publicsuffix2
|
|
Author: nexB Inc., Tomaz Solc, David Wilson and others.
|
|
Author-email: info@nexb.com
|
|
License: MIT and MPL-2.0
|
|
Keywords: domain,public suffix,suffix,dns,tld,sld,psl,idna
|
|
Platform: UNKNOWN
|
|
Classifier: Intended Audience :: Developers
|
|
Classifier: License :: OSI Approved :: MIT License
|
|
Classifier: License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)
|
|
Classifier: Programming Language :: Python
|
|
Classifier: Programming Language :: Python :: 2
|
|
Classifier: Programming Language :: Python :: 3
|
|
Classifier: Topic :: Internet :: Name Service (DNS)
|
|
Classifier: Topic :: Utilities
|
|
Classifier: Development Status :: 5 - Production/Stable
|
|
Description-Content-Type: text/x-rst
|
|
|
|
Public Suffix List module for Python
|
|
====================================
|
|
|
|
This module allows you to get the public suffix, as well as the registrable domain,
|
|
of a domain name using the Public Suffix List from http://publicsuffix.org
|
|
|
|
A public suffix is a domain suffix under which you can register domain
|
|
names, or under which the suffix owner does not control the subdomains.
|
|
Some examples of public suffixes in the former example are ".com",
|
|
".co.uk" and "pvt.k12.wy.us"; examples of the latter case are "github.io" and
|
|
"blogspot.com". The public suffix is sometimes referred to as the effective
|
|
or extended TLD (eTLD).
|
|
Accurately knowing the public suffix of a domain is useful when handling
|
|
web browser cookies, highlighting the most important part of a domain name
|
|
in a user interface or sorting URLs by web site. It is also used in a wide range
|
|
of research and applications that leverages Domain Name System (DNS) data.
|
|
|
|
This module builds the public suffix list as a Trie structure, making it more efficient
|
|
than other string-based modules available for the same purpose. It can be used
|
|
effectively in large-scale distributed environments, such as PySpark.
|
|
|
|
This Python module includes with a copy of the Public Suffix List (PSL) so that it is
|
|
usable out of the box. Newer versions try to provide reasonably fresh copies of
|
|
this list. It also includes a convenience method to fetch the latest list. The PSL does
|
|
change regularly.
|
|
|
|
The code is a fork of the publicsuffix package and includes the same base API. In
|
|
addition, it contains a few variants useful for certain use cases, such as the option to
|
|
ignore wildcards or return only the extended TLD (eTLD). You just need to import publicsuffix2 instead.
|
|
|
|
The public suffix list is now provided in UTF-8 format. To correctly process
|
|
IDNA-encoded domains, either the query or the list must be converted. By default, the
|
|
module converts the PSL. If your use case includes UTF-8 domains, e.g., '食狮.com.cn',
|
|
you'll need to set the IDNA-encoding flag to False on instantiation (see examples below).
|
|
Failure to use the correct encoding for your use case can lead to incorrect results for
|
|
domains that utilize unicode characters.
|
|
|
|
The code is MIT-licensed and the publicsuffix data list is MPL-2.0-licensed.
|
|
|
|
|
|
|
|
Usage
|
|
-----
|
|
|
|
Install with::
|
|
|
|
pip install publicsuffix2
|
|
|
|
The module provides functions to obtain the base domain, or sld, of an fqdn, as well as one
|
|
to get just the public suffix. In addition, the functions a number of boolean parameters that
|
|
control how wildcards are handled. In addition to the functions, the module exposes a class that
|
|
parses the PSL, and allows for more control.
|
|
|
|
The module provides two equivalent functions to query a domain name, and return the base domain,
|
|
or second-level-doamin; get_public_suffix() and get_sld()::
|
|
|
|
>>> from publicsuffix2 import get_public_suffix
|
|
>>> get_public_suffix('www.example.com')
|
|
'example.com'
|
|
>>> get_sld('www.example.com')
|
|
'example.com'
|
|
>>> get_public_suffix('www.example.co.uk')
|
|
'example.co.uk'
|
|
>>> get_public_suffix('www.super.example.co.uk')
|
|
'example.co.uk'
|
|
>>> get_sld("co.uk") # returns eTLD as is
|
|
'co.uk'
|
|
|
|
This function loads and caches the public suffix list. To obtain the latest version of the
|
|
PSL, use the fetch() function to first download the latest version. Alternatively, you can pass
|
|
a custom list.
|
|
|
|
For more control, there is also a class that parses a Public
|
|
Suffix List and allows the same queries on individual domain names::
|
|
|
|
>>> from publicsuffix2 import PublicSuffixList
|
|
>>> psl = PublicSuffixList()
|
|
>>> psl.get_public_suffix('www.example.com')
|
|
'example.com'
|
|
>>> psl.get_public_suffix('www.example.co.uk')
|
|
'example.co.uk'
|
|
>>> psl.get_public_suffix('www.super.example.co.uk')
|
|
'example.co.uk'
|
|
>>> psl.get_sld('www.super.example.co.uk')
|
|
'example.co.uk'
|
|
|
|
Note that the ``host`` part of an URL can contain strings that are
|
|
not plain DNS domain names (IP addresses, Punycode-encoded names, name in
|
|
combination with a port number or a username, etc.). It is up to the
|
|
caller to ensure only domain names are passed to the get_public_suffix()
|
|
method.
|
|
|
|
The get_public_suffix() function and the PublicSuffixList class initializer accept
|
|
an optional argument pointing to a public suffix file. This can either be a file
|
|
path, an iterable of public suffix lines, or a file-like object pointing to an
|
|
opened list::
|
|
|
|
>>> from publicsuffix2 import get_public_suffix
|
|
>>> psl_file = 'path to some psl data file'
|
|
>>> get_public_suffix('www.example.com', psl_file)
|
|
'example.com'
|
|
|
|
Note that when using get_public_suffix() a global cache keeps the latest provided
|
|
suffix list data. This will use the cached latest loaded above::
|
|
|
|
>>> get_public_suffix('www.example.co.uk')
|
|
'example.co.uk'
|
|
|
|
**IDNA-encoding.** The public suffix list is now in UTF-8 format. For those use cases that
|
|
include IDNA-encoded domains, the list must be converted. Publicsuffix2 includes idna
|
|
encoding as a parameter of the PublicSuffixList initialization and is true by
|
|
default. For UTF-8 use cases, set the idna parameter to False::
|
|
|
|
>>> from publicsuffix2 import PublicSuffixList
|
|
>>> psl = PublicSuffixList(idna=True) # on by default
|
|
>>> psl.get_public_suffix('www.google.com')
|
|
'google.com'
|
|
>>> psl = PublicSuffixList(idna=False) # use UTF-8 encodings
|
|
>>> psl.get_public_suffix('食狮.com.cn')
|
|
'食狮.com.cn'
|
|
|
|
**Ignore wildcards.** In some use cases, particularly those related to large-scale domain processing,
|
|
the user might want to ignore wildcards to create more aggregation. This is possible by setting
|
|
the parameter wildcard=False.::
|
|
|
|
>>> psl.get_public_suffix('telinet.com.pg', wildcard=False)
|
|
'com.pg'
|
|
>>> psl.get_public_suffix('telinet.com.pg', wildcard=True)
|
|
'telinet.com.pg'
|
|
|
|
**Require valid eTLDs (strict).** In the publicsuffix2 module, a domain with an invalid TLD will still return
|
|
return a base domain, e.g,::
|
|
|
|
>>> psl.get_public_suffix('www.mine.local')
|
|
'mine.local'
|
|
|
|
This is useful for many use cases, while in others, we want to ensure that the domain includes a
|
|
valid eTLD. In this case, the boolean parameter strict provides a solution. If this flag is set,
|
|
an invalid TLD will return None.::
|
|
|
|
>>> psl.get_public_suffix('www.mine.local', strict=True) is None
|
|
True
|
|
|
|
**Return eTLD only.** The standard use case for publicsuffix2 is to return the registrable,
|
|
or base, domain
|
|
according to the public suffix list. In some cases, however, we only wish to find the eTLD
|
|
itself. This is available via the get_tld() method.::
|
|
|
|
>>> psl.get_tld('www.google.com')
|
|
'com'
|
|
>>> psl.get_tld('www.google.co.uk')
|
|
'co.uk'
|
|
|
|
All of the methods and functions include the wildcard and strict parameters.
|
|
|
|
For convenience, the public method get_sld() is available. This is identical to the method
|
|
get_public_suffix() and is intended to clarify the output for some users.
|
|
|
|
To **update the bundled suffix list** use the provided setup.py command::
|
|
|
|
python setup.py update_psl
|
|
|
|
The update list will be saved in `src/publicsuffix2/public_suffix_list.dat`
|
|
and you can build a new wheel with this bundled data.
|
|
|
|
Alternatively, there is a fetch() function that will fetch the latest version
|
|
of a Public Suffix data file from https://publicsuffix.org/list/public_suffix_list.dat
|
|
You can use it this way::
|
|
|
|
>>> from publicsuffix2 import get_public_suffix
|
|
>>> from publicsuffix2 import fetch
|
|
>>> psl_file = fetch()
|
|
>>> get_public_suffix('www.example.com', psl_file)
|
|
'example.com'
|
|
|
|
Note that the once loaded, the data file is cached and therefore fetched only
|
|
once.
|
|
|
|
The extracted public suffix list, that is the tlds and their modifiers, is put into
|
|
an instance variable, tlds, which can be accessed as an attribute, tlds.::
|
|
|
|
>>> psl = PublicSuffixList()
|
|
>>> psl.tlds[:5]
|
|
['ac',
|
|
'com.ac',
|
|
'edu.ac',
|
|
'gov.ac',
|
|
'net.ac']
|
|
|
|
**Using the module in large-scale processing**
|
|
If using this library in large-scale pyspark processing, you should instantiate the class as
|
|
a global variable, not within a user function. The class methods can then be used within user
|
|
functions for distributed processing.
|
|
|
|
Source
|
|
------
|
|
|
|
Get a local copy of the development repository. The development takes
|
|
place in the ``develop`` branch. Stable releases are tagged in the ``master``
|
|
branch::
|
|
|
|
git clone https://github.com/nexB/python-publicsuffix2.git
|
|
|
|
|
|
History
|
|
-------
|
|
This code is forked from Tomaž Šolc's fork of David Wilson's code.
|
|
|
|
Tomaž Šolc's code originally at:
|
|
|
|
https://www.tablix.org/~avian/git/publicsuffix.git
|
|
|
|
Copyright (c) 2014 Tomaž Šolc <tomaz.solc@tablix.org>
|
|
|
|
David Wilson's code was originally at:
|
|
|
|
http://code.google.com/p/python-public-suffix-list/
|
|
|
|
Copyright (c) 2009 David Wilson
|
|
|
|
|
|
License
|
|
-------
|
|
|
|
The code is MIT-licensed.
|
|
The vendored public suffix list data from Mozilla is under the MPL-2.0.
|
|
|
|
Copyright (c) 2015 nexB Inc. and others.
|
|
|
|
Copyright (c) 2014 Tomaž Šolc <tomaz.solc@tablix.org>
|
|
|
|
Copyright (c) 2009 David Wilson
|
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a
|
|
copy of this software and associated documentation files (the "Software"),
|
|
to deal in the Software without restriction, including without limitation
|
|
the rights to use, copy, modify, merge, publish, distribute, sublicense,
|
|
and/or sell copies of the Software, and to permit persons to whom the
|
|
Software is furnished to do so, subject to the following conditions:
|
|
|
|
The above copyright notice and this permission notice shall be included in
|
|
all copies or substantial portions of the Software.
|
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
|
|
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
|
|
DEALINGS IN THE SOFTWARE.
|
|
|
|
Changelog
|
|
---------
|
|
|
|
2019-12-19 publicsuffix2 2.20191219
|
|
|
|
* Add new strict mode to get_tld() by @hiratara .
|
|
* Update TLD list
|
|
* Add tests from Mozilla test suite
|
|
|
|
|
|
2019-08-12 publicsuffix2 2.20190812
|
|
|
|
* Fix regression in available tlds.
|
|
* Format and streamline code.
|
|
|
|
|
|
2019-08-11 publicsuffix2 2.20190811
|
|
|
|
* Update publicsuffix.file to the latest version from Mozilla.
|
|
|
|
|
|
2019-08-08 publicsuffix2 2.20190808
|
|
|
|
* Add additional functionality and handles change to PSL format
|
|
* Add attribute to retrieve the PSL as a list
|
|
|
|
|
|
2019-02-05 publicsuffix2 2.201902051213
|
|
|
|
* Update publicsuffix.file to the latest version from Mozilla.
|
|
* Restore a fetch() function by popular demand
|
|
|
|
|
|
2018-12-13 publicsuffix2 2.20181213
|
|
|
|
* Update publicsuffix.file to the latest version from Mozilla.
|
|
|
|
|
|
2018-10-01 publicsuffix2 2.20180921.2
|
|
|
|
* Update publicsuffix.file to the latest version from Mozilla.
|
|
* Breaking API change: publicsuffix module renamed to publicsuffix2
|
|
|
|
|
|
2016-08-18 publicsuffix2 2.20160818
|
|
|
|
* Update publicsuffix.file to the latest version from Mozilla.
|
|
|
|
|
|
2016-06-21 publicsuffix2 2.20160621
|
|
|
|
* Update publicsuffix.file to the latest version from Mozilla.
|
|
* Adopt new version scheme: major.<publisiffix list date>
|
|
|
|
|
|
2015-10-12 publicsuffix2 2.1.0
|
|
|
|
* Merged latest updates from publicsuffix
|
|
* Added new convenience top level get_public_suffix_function caching
|
|
a loaded list if needed.
|
|
* Updated publicsuffix.file to the latest version from Mozilla.
|
|
* Added an update_psl setup command to fetch and vendor the latest list
|
|
Use as: python setup.py update_psl
|
|
|
|
|
|
2015-06-04 publicsuffix2 2.0.0
|
|
|
|
* Forked publicsuffix, but kept the same API
|
|
* Updated publicsuffix.file to the latest version from Mozilla.
|
|
* Changed packaging to have the suffix list be package data
|
|
and be wheel friendly.
|
|
* Use spaces indentation, not tabs
|
|
|
|
|
|
2014-01-14 publicsuffix 1.0.5
|
|
|
|
* Correctly handle fully qualified domain names (thanks to Matthäus
|
|
Wander).
|
|
* Updated publicsuffix.txt to the latest version from Mozilla.
|
|
|
|
2013-01-02 publicsuffix 1.0.4
|
|
|
|
* Added missing change log.
|
|
|
|
2013-01-02 publicsuffix 1.0.3
|
|
|
|
* Updated publicsuffix.txt to the latest version from Mozilla.
|
|
* Added trove classifiers.
|
|
* Minor update of the README.
|
|
|
|
2011-10-10 publicsuffix 1.0.2
|
|
|
|
* Compatibility with Python 3.x (thanks to Joern
|
|
Koerner) and Python 2.5
|
|
|
|
2011-09-22 publicsuffix 1.0.1
|
|
|
|
* Fixed installation issue under virtualenv (thanks to
|
|
Mark McClain)
|
|
|
|
2011-07-29 publicsuffix 1.0.0
|
|
|
|
* First release
|
|
|
|
|