Now available: EP full-text data for analytics

EPO / PATSTAT Support · Post by **EPO / PATSTAT Support** » Wed Jun 05, 2019 9:40 am

Patent data analysts, researchers and PATSTAT users have been asking for it since long.
A bulk data set consisting of XML-tagged titles, abstracts, descriptions, claims and search reports covering all EP publications, designed to facilitate natural language processing work and linkable to PATSTAT.
On this web page you will find all the information you need: https://www.epo.org/searching-for-paten ... ytics.html
And here is the user manual:

EP_full_text_data_for_text_analytics-user_guide_v1.0_en.pdf: (622.44 KiB) Downloaded 271 times

In a nutshell:
The EP full text data is available in 2 different products:

EP full-text data
EP full-text data for text analytics

The first one is the full set including PDF, TIFF and XML tagged data, distributed and delivered on HDD as part of “EP full-text data” bulk data product. The size of this data set is 2.23 TB (zipped); it is delivered on a HDD and can be ordered here: https://www.epo.org/searching-for-paten ... html#tab-1 For most PATSTAT users, we think that this extended data set is more than what is typically needed because PATSTAT already contains all the bibliographical data nicely structured in a relational data base.

That brings us to the new product “EP full-text data for analytics”. This data set is a much more compact extraction (230GB unzipped for all 5.8 million EP publications) from the full set which “only” includes the title, abstract, description, (amended) claims, search report and a URL to the PDF document. A link to the PATSTAT data base can be established via the publication table (TLS211_PAT_PUBLN or REG102_PAT_PUBLN) with the publication authority “EP”, the publication number, the publication kind code and the publication date being the linking fields. This data set is free and available for download from Google Cloud Platform. Download fees charged by Google are payable by the user. If you nevertheless prefer to have it delivered by the EPO on a HDD, then order “EP full-text data” and we will include it on that drive.

The PATSTAT team is convinced that this data set will be highly beneficial for machine learning and natural language processing and will open new opportunities for econometric and patent based research.
So we hope that the community will come up with new and existing algorithms to extract more intelligence out of this vast amount of patent data.
Looking forward to your feedback !

Fr3dY · Post by **Fr3dY** » Wed Jun 26, 2019 12:34 pm

Hi,

Where's the free dataset, available for download from Google Cloud Platform? Couldn't find the link, do I have to specifically request it? And what about these Google fees?

Regards,

mkracker · Post by **mkracker** » Wed Jun 26, 2019 4:26 pm

Hi,

The data set is open (CC-BY) and free. You do not need to request it.

Detailed information is on the product page https://www.epo.org/searching-for-paten ... -analytics.

The user manual (accessible from bottom of the product page of tab "Getting started") on page 6 describes how to download it from Google Cloud Platform. Google will charge you for the download. I estimate that the download of all files (about 210 GB) will cost you about 25 USD. Different fees may apply for Australia and China.

Alternatively, you may request it on an external hard disk drive from the EPO, but for a considerably higher fee.

Martin / EPO

Now available: EP full-text data for analytics

Now available: EP full-text data for analytics

Re: Now available: EP full-text data for analytics

Re: Now available: EP full-text data for analytics