UniProt to patent data

This is the place where the linked data/open data community can ask and respond to questions about or share experiences with EPO’s open bulk data sets. The moderator will use this forum to announce product related news.
Post Reply

jerven
Posts: 3
Joined: Thu Jul 18, 2019 9:27 am

UniProt to patent data

Post by jerven » Thu Jul 18, 2019 9:43 am

Dear EPO,

It is now possible to do searches from UniProt to the EPO linked data sparql endpoint.
UniProt is a large data base in the life sciences with a professionally supported sparql endpoint and
RDF production that has been running for 15 years.

The combination of UniProt and EPO allow us to answer questions like which proteins where described in
patent publications for which the patent was granted more than 20 years ago. Try this example at https://sparql.uniprot.org

Code: Select all

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos:<http://www.w3.org/2004/02/skos/core#> 
PREFIX up:<http://purl.uniprot.org/core/> 
prefix patent: <http://data.epo.org/linked-data/def/patent/>
SELECT ?grantDate ?patent ?application ?applicationNo
WHERE
{
  ?citation a up:Patent_Citation ;
  skos:exactMatch ?patent .
  BIND(SUBSTR(STR(?patent), 35) AS ?applicationNo)
  BIND(SUBSTR(STR(?patent), 33, 2) AS ?countryCode)
  SERVICE<https://data.epo.org/linked-data/query>{
    ?publication patent:publicationNumber ?applicationNo ;
                 patent:application ?application . 
    ?application patent:grantDate ?grantDate .
  }
  BIND((year(now()) - 20) AS ?thisYearMinusTwenty)
  BIND(year(?grantDate) AS ?grantYear)
  FILTER(?grantYear < ?thisYearMinusTwenty)
} ORDER BY ?grantYear
Now you notice that there are quite a few modelling differences. Some of which can be improved.

For example

Code: Select all

citation:SIP70B9B735F3162330 rdf:type up:Patent_Citation ;
  up:title "DNA sequences coding for the DR beta-chain locus of the human lymphocyte antigen complex and polypeptides, diagnostic typing processes and products related thereto." ;
  up:author "Mach B.F." ,
    "Long E.O." ,
    "Wake C.T." ;
  up:date "1984-03-28"^^xsd:date ;
  skos:exactMatch <http://purl.uniprot.org/patents/EP0103960> .
Is actually matching https://data.epo.org/linked-data/data/p ... 03960/A2/-

I was wondering how we can better allign the UniProt citations model with the EPO one.
And if the query I posted above is misleading in some way. Considering my limited knowledge of the EPO data model.

Regards,
Jerven


EPO / PATSTAT Support
Posts: 425
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: UniProt to patent data

Post by EPO / PATSTAT Support » Thu Aug 29, 2019 8:12 am

Hi Jerven,

I applaud your effort to link from the extensive UniProt data set to the EPO patents. Let me give some comment and notes for improvement, because patent information is not always self-explanatory.

For easier reference, I attached a PPT slide with a real example.
UniProt-Example_v1.0.pptx
(935.74 KiB) Downloaded 680 times
In general:
  • A (patent) application is identified by the application number plus(!) the (country) code of the patent office. EP is the code for the European Patent office, US for the USA patent and trademark office, ... The same number can occur in several offices, so the office code is essential.
    In your SPARQL query you forgot to match also the office code.
  • Each application has one or more (patent) publications. They are identified by the publication number, publication office, publication kind code (A1, A2, ..B1, ..) and (in rare cases) also by the publication date.
    The office of the application and the publication are always identical. However, the application number and the number of their publications are different.
    The patent citations in UniProt refer to publication numbers, so you must not match them with application numbers but publication numbers.
  • There is the great concept of a “(Simple) patent family”. A family consists of 1 or more patent applications, which – of course- have different numbers and are from different offices, but which contain the same Idea / same invention. E.g. these applications may be translation, because different offices may have different language requirements.
  • EPO LOD has a lot of information about EP applications and publications. This includes all the (non-EP) family members of all families which contain an EP application. For these non-EP family members only the identifying information for applications and publications is available.
Going back to the example on the attached slide: The correct way to combine patent citation from UniProt (lower left corner) to EPO LOD might be:
  1. Match UniProt citation with a publication via publication number AND office. In the example you will get a match with a US publication.
  2. You may navigate further to the US application.
  3. If you want to get more data, you have to find EP application and publication via the family.
Here is a SPARQL query I did based on your example:
I removed the filter on the grant date to get more hits. Note that
  • patent:grantDate is an optional property. It is only available for EP applications. And if the EP patent has not been granted (yet), the property is missing.
  • To simplify the query, I removed the filters on the grant date.

Code: Select all

PREFIX patent:<http://data.epo.org/linked-data/def/patent/>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos:<http://www.w3.org/2004/02/skos/core#> 
PREFIX up:<http://purl.uniprot.org/core/> 

SELECT ?UPpatentPublication ?EPOpublication ?publicationNo ?EPOapplication ?grantDate
WHERE
{
  ?citation a up:Patent_Citation ;
  skos:exactMatch ?UPpatentPublication .
  BIND(SUBSTR(STR(?UPpatentPublication), 35) AS ?publicationNo)
  BIND(SUBSTR(STR(?UPpatentPublication), 33, 2) AS ?countryCode)

  SERVICE<https://data.epo.org/linked-data/query>{
    ?EPOpublication patent:publicationNumber ?publicationNo;
                     patent:publicationAuthority/skos:notation ?countryCode;
                     patent:application ?EPOapplication. 
  OPTIONAL {
    ?EPOapplication patent:grantDate ?grantDate .
    }
  }
} LIMIT 1000
I hope this helps.
Martin / EPO


jerven
Posts: 3
Joined: Thu Jul 18, 2019 9:27 am

Re: UniProt to patent data

Post by jerven » Wed Jun 17, 2020 10:35 pm

Dear Martin,

I finally had some time to look into correcting our data using the sparql endpoint (and there are quite a few corrections to be done on our side.)

I however came across this publication WO/2009114939/A1 which I can't seem to find in your sparql endpoint.

Classic search finds this one https://register.epo.org/application?nu ... 7&tab=main

Regards,
Jerven


jerven
Posts: 3
Joined: Thu Jul 18, 2019 9:27 am

Re: UniProt to patent data

Post by jerven » Fri Jun 19, 2020 8:05 pm

It will be very likely that from release UniParc 2020_04 we will link some 2,3 million sequences to about 63,000 patent publications. Via 4.5 million intermediate objects.

This is besides the UniProtKB to publication improvements.


adamusa
Posts: 1
Joined: Fri Jan 26, 2024 3:26 am

Re: UniProt to patent data

Post by adamusa » Fri Jan 26, 2024 3:29 am

UniProt provides protein data from a variety of sources, including contributions from research groups, data from scientific articles, and other data sources. slither io


Post Reply