Concatenating PDF pages from OPS

This space is made available to users of Open Patent Services (OPS) web-service and now also to users of EPO’s bulk data subscription products such as 14. EPO worldwide bibliographic database (DOCDB), 14.11 EPO worldwide legal status database (INPADOC), 14.12 EP full text data, 14.1 EP bibliographic data (EBD)and more.

Users can ask each other questions, exchange experiences and solutions, post ideas. The moderator will use this space to announce changes or other relevant information.
Post Reply

Mike_k43
Posts: 52
Joined: Fri Jan 06, 2017 1:34 pm

Concatenating PDF pages from OPS

Post by Mike_k43 » Thu May 18, 2017 8:19 am

We are currently evaluation downloading PDF patent documents from OPS.

The OPS user manual states that OPS provides them on a page-per-page basis, an apparently the user must concatenate the individual pages at his end to one single running PDF. The EPO website indicates third party PDF packages for doing so. The PDF file specification is however not very strict. Concatenation may be done e.g. as an "incremental update" (as per the PDF ISO specification), on the other hand, many PDF printers seem to produce a "consolidated" concatenated PDF. The former "incrementally updated" concatenated PDF file will have several crossreference sections and trailers, whereas the latter "consolidated" concatenated PDF file will contain only one crossreference section and only one trailer. The ordering of indirect objects in the PDF file may be arbitrary. There are indications in the file as to the author of the file, the software that produces the concatenated file, a digital ID and a modification date. In view of the allowed variabilities in a PDF file, any concatenated PDF file for a given patent publication obtained from individual PDF pages from OPS will predictably differ from the single PDF file for the same patent publication obtained from Espacenet. This might incite a suspicion that the former has been "forged" with respect to the latter.

Is thus there a "recommended" way of concatenating said individual PDF pages from OPS into a single PDF file?

Many thanks.


EPO / OPS Support
Posts: 1298
Joined: Thu Feb 22, 2007 5:32 pm

Re: Concatenating PDF pages from OPS

Post by EPO / OPS Support » Thu May 18, 2017 8:39 am

Hi,

We can not give any other details on how to treat PDF files as what you have already found in OPS documentation and in our OPS FAQ's. OPS offers you only page by page access and this is planed to stay that way.

We can only offer another choice for loading EP files via web service of European Publication Server: https://data.epo.org/publication-server/?lg=en. How their web service works is described here: http://www.epo.org/searching-for-patent ... /help.html We don't have any alternative for worldwide coverage, unfortunately.

A far as difference between full text files in OPS and Espacenet, please remember that loading any original document from Espacenet and putting them in your own database and search tool that is available to third party is not allowed according to our Terms or use for using website and Fair use charter.

Regards,
OPS support


Mike_k43
Posts: 52
Joined: Fri Jan 06, 2017 1:34 pm

Re: Concatenating PDF pages from OPS

Post by Mike_k43 » Mon May 22, 2017 10:49 am

Dear OPS Support,

Thanks for confirming that OPS will continue to provide page-by-page PDFs.

I now just need an answer to one very specific question: The PDF's from Espacenet are in "Linearized PDF" format. Are the single page PDF's from OPS also in "Linearized PDF" or are they in "normal" PDF format?

Regards, mike_k43


EPO / OPS Support
Posts: 1298
Joined: Thu Feb 22, 2007 5:32 pm

Re: Concatenating PDF pages from OPS

Post by EPO / OPS Support » Mon May 22, 2017 10:52 am

Hi,

Espacenet and OPS load the same data, so it must be the same format.

Regards,
OPS support


Mike_k43
Posts: 52
Joined: Fri Jan 06, 2017 1:34 pm

Re: Concatenating PDF pages from OPS

Post by Mike_k43 » Mon Jun 12, 2017 1:09 pm

As an addendum, we now have the concatenating software running, using the OAuth2 authorization, There is now one final problem with the format of the publication numbers:

E.g. as from the manual, for EP1000000A1::
GET https://ops.epo.org/rest-services/publi ... /fullimage
X-OPS-Range: 1
Accept:application/pdf
Finds the 1st page of EP1000000A1.

But, for US2015045825A1:
GET https://ops.epo.org/rest-services/publi ... /fullimage
X-OPS-Range: 1
Accept:application/pdf
Returns 404 Not found.

What is the required number format for the image retrieval?

Regards, mike_k43


EPO / OPS Support
Posts: 1298
Joined: Thu Feb 22, 2007 5:32 pm

Re: Concatenating PDF pages from OPS

Post by EPO / OPS Support » Mon Jun 12, 2017 1:30 pm

This is a result of your query for US document in OPS Published Service with Image constituent and
for full text you can see a url below

1.) step OPS Published Service with Image constituent to see all available images and their partial links that you need for downloading documents:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.2/style/pub-inquiry.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
<ops:document-inquiry>
<ops:publication-reference>
<document-id document-id-type="docdb">
<country>US</country>
<doc-number>2015045825</doc-number>
<kind>A1</kind>
</document-id>
</ops:publication-reference>
<ops:inquiry-result>
<publication-reference>
<document-id document-id-type="docdb">
<country>US</country>
<doc-number>2015045825</doc-number>
<kind>A1</kind>
</document-id>
</publication-reference>
<ops:document-instance system="ops.epo.org" number-of-pages="41" desc="FullDocument" link="published-data/images/US/20150045825/A1/fullimage">
<ops:document-format-options>
<ops:document-format>application/pdf</ops:document-format>
<ops:document-format>application/tiff</ops:document-format>
</ops:document-format-options>
<ops:document-section name="ABSTRACT" start-page="1"/>
<ops:document-section name="BIBLIOGRAPHY" start-page="1"/>
<ops:document-section name="CLAIMS" start-page="41"/>
<ops:document-section name="DESCRIPTION" start-page="21"/>
<ops:document-section name="DRAWINGS" start-page="2"/>
</ops:document-instance>
<ops:document-instance system="ops.epo.org" number-of-pages="19" desc="Drawing" link="published-data/images/US/20150045825/A1/thumbnail">
<ops:document-format-options>
<ops:document-format>application/pdf</ops:document-format>
<ops:document-format>application/tiff</ops:document-format>
</ops:document-format-options>
<ops:document-section name="DRAWINGS" start-page="1"/>
</ops:document-instance>
</ops:inquiry-result>

2.) take the partial URL for full text, add in URL the URL of OPS service: http://ops.epo.org/3.2/rest-services/and add Range for every page (there are 41 pages and you have to query each seperatly):

so as XML shows the complete URL is:
http://ops.epo.org/3.1/rest-services/pu ... ge?Range=1

only for the purpose of demonstration I am using 3.1 version, you should not use 3.1 but rather production version of 3.2

Also, you always have to retrieve images in those two steps because each PDF or image collections has its own system for image retrieval so there are no general rules on how you define a url as far as document number goes.

OPS support


Post Reply