New data set: EP full-text data for text analytics

This is the place where the linked data/open data community can ask and respond to questions about or share experiences with EPO’s open bulk data sets. The moderator will use this forum to announce product related news.
Post Reply

Posts: 124
Joined: Thu Feb 22, 2007 5:33 pm

New data set: EP full-text data for text analytics

Post by EPO / PATSTAT Support » Wed Jun 19, 2019 7:23 am

Patent data analysts, researchers and PATSTAT users have been asking for it since long.
A bulk data set consisting of XML-tagged titles, abstracts, descriptions, claims and search reports covering all EP publications, designed to facilitate natural language processing work and linkable to PATSTAT or other EP data sets with bibliographic data.

On this web page you will find all the information you need: ... -analytics
A user guide is downloadable from the bottom of this page.

This data set is a compact extraction (230 GB unzipped for all 5.8 million EP publications) from the "EP full-text data" set. It includes the title, abstract, description, (amended) claims, search report and a URL to the PDF document, but no images or TIFFed document pages.

This data set has a CC-BY license and is available for download from Google Cloud Platform. Download fees charged by Google are payable by the user. If you nevertheless prefer to have it delivered by the EPO on a HDD, then order “EP full-text data” and we will include and ship it on that drive.

To select only certain publications, e.g. publications within a specific technical field, you need to combine it via the publication number, kind and date with another data set containing bibliographic information of EP patents. e. g. with PATSTAT, EP Bulletin Search, Global Patent Index, OPS, ...

We are convinced that this data set will be highly beneficial for machine learning and natural language processing and will open new opportunities for linguistic, econometric and patent based research.
So we hope that the community will come up with new and existing algorithms to extract more intelligence out of this vast amount of patent data.
Looking forward to your feedback !
PATSTAT Support Team
EPO - Vienna
patstat @

Post Reply