A bulk data set consisting of XML-tagged titles, abstracts, descriptions, claims and search reports covering all EP publications, designed to facilitate natural language processing work and linkable to PATSTAT.
On this web page you will find all the information you need: https://www.epo.org/searching-for-paten ... ytics.html
And here is the user manual: In a nutshell:
The EP full text data is available in 2 different products:
- EP full-text data
- EP full-text data for text analytics
That brings us to the new product “EP full-text data for analytics”. This data set is a much more compact extraction (230GB unzipped for all 5.8 million EP publications) from the full set which “only” includes the title, abstract, description, (amended) claims, search report and a URL to the PDF document. A link to the PATSTAT data base can be established via the publication table (TLS211_PAT_PUBLN or REG102_PAT_PUBLN) with the publication authority “EP”, the publication number, the publication kind code and the publication date being the linking fields. This data set is free and available for download from Google Cloud Platform. Download fees charged by Google are payable by the user. If you nevertheless prefer to have it delivered by the EPO on a HDD, then order “EP full-text data” and we will include it on that drive.
The PATSTAT team is convinced that this data set will be highly beneficial for machine learning and natural language processing and will open new opportunities for econometric and patent based research.
So we hope that the community will come up with new and existing algorithms to extract more intelligence out of this vast amount of patent data.
Looking forward to your feedback !