EPO - Discussion forums

Posted: **Mon Nov 11, 2019 3:28 pm**

Hi PATSTAT community/support,

I am using the PATSTAT database for my research project and I have never used it before.
At this moment in time I am able to write a code and successfully download data from PATSAT.
However, for my project I am going to use a more complex code with more variables. Some variables come from tables tls 801, 803, 901, 902 and 904. I see that these tables do not have any identifier. I am able to use the tables with identifiers in it but I can't figure out a way to add the ones without an identifier. My question is how can I select and connect them in my SQL code that they appear in my result table.

I am using PATSTAT Autumn 2019 and the database is PATSTAT online

Kind regards,

Richard

Posted: **Tue Nov 12, 2019 5:06 pm**

Hello Richard,
"Connecting" or "linking" tables in PATSTAT is done via the different kinds of SQL "JOIN" s. A JOIN clause is used to combine rows from two or more tables, based on (a) related column(s) between them. This "related column" is very often the primary key or a combination of "foreign keys".

On the PATSTAT web pages (and this forum) you will find ample examples where tables have been joined.

Here is example making use of the TLS904_NUTS table to do a count of EP applications distributed over the NUTS regions in Italy.

Code: Select all

SELECT tls904_nuts.nuts ,  nuts_label,
COUNT(distinct(case when appln_filing_year = 2010 then tls201_appln.appln_id end)) as "2010",
COUNT(distinct(case when appln_filing_year = 2011 then tls201_appln.appln_id end)) as "2011",
COUNT(distinct(case when appln_filing_year = 2012 then tls201_appln.appln_id end)) as "2012",
COUNT(distinct(case when appln_filing_year = 2013 then tls201_appln.appln_id end)) as "2013",
COUNT(distinct(case when appln_filing_year = 2014 then tls201_appln.appln_id end)) as "2014",
COUNT(distinct(case when appln_filing_year = 2015 then tls201_appln.appln_id end)) as "2015",
COUNT(distinct(case when appln_filing_year = 2016 then tls201_appln.appln_id end)) as "2016",
COUNT(distinct(case when appln_filing_year = 2017 then tls201_appln.appln_id end)) as "2017",
COUNT(distinct(case when appln_filing_year = 2018 then tls201_appln.appln_id end)) as "2018"
  FROM tls201_appln join tls207_pers_appln on tls201_appln.appln_id = tls207_pers_appln.appln_id
   join tls206_person  on tls206_person.person_id = tls207_pers_appln.person_id  
   join tls904_nuts on tls904_nuts.nuts = tls206_person.nuts
    where appln_auth = 'ep' 
	and tls201_appln.appln_id in (select appln_id from tls207_pers_appln join tls206_person  
	   on tls206_person.person_id = tls207_pers_appln.person_id 
	   and applt_seq_nr > 0 and person_ctry_code = 'it')
	and appln_filing_year between 2010 and 2018
	and left(tls904_nuts.nuts,2) = 'IT'
group by tls904_nuts.nuts,  nuts_label
order by  tls904_nuts.nuts ,  nuts_label

The logical data base model in the data catalog gives you a good idea on the most obvious JOINS.

Posted: **Wed Nov 13, 2019 2:09 pm**

Hello PATSTAT support,

Thank you for your response, I was able to include tls801 thank you! As you can see in the code below I need to include a lot of variables for my research project. The problem I have now is that I cannot link variables from tls214 because there are no similar ID from other tables in that table. How am I able to include this table as well? I need to include the variables NPL_author, NPL_editor, NPL_page_first, NPL_page_last from tls214.

Code: Select all

SELECT DISTINCT a.appln_filing_year, a.appln_id, a.appln_auth, a.appln_nr, a.earliest_publn_year, a.granted, a.nb_citing_docdb_fam, r.appln_title, t.pat_publn_id, t.publn_auth, t.publn_date, t.publn_first_grant, t.publn_nr, z.cited_appln_id, x.cited_pat_publn_id, b.citn_categ, s.citn_origin, s.citn_replenished, i.ipc_class_level, i.ipc_class_symbol, i.ipc_gener_auth, i.ipc_value, i.ipc_version, q.psn_name, q.psn_id, m.address_1, m.address_2, s.npl_citn_seq_nr, s.pat_citn_seq_nr, c.ctry_code
FROM tls201_appln a  
JOIN tls209_appln_ipc i ON a.appln_id = i.appln_id  
JOIN tls202_appln_title r ON a.appln_id = r.appln_id
JOIN tls211_pat_publn t ON a.appln_id = t.appln_id
JOIN tls212_citation s ON t.pat_publn_id = s.pat_publn_id
JOIN tls212_citation z ON s.pat_publn_id = z.cited_appln_id
JOIN tls212_citation x ON s.pat_publn_id = x.cited_pat_publn_id
JOIN tls215_citn_categ b ON t.pat_publn_id = b.pat_publn_id
JOIN tls207_pers_appln o ON a.appln_id = o.appln_id  
JOIN tls206_person q ON o.person_id = q.person_id  
JOIN tls226_person_orig m ON o.person_id = m.person_id
JOIN tls801_country c ON tls801_country.ctry_code = tls206_person.person_ctry_code
WHERE a.appln_filing_year BETWEEN 1998 AND 2018
ORDER BY a.appln_filing_year, a.appln_id

The second question is if there are shortcuts for the code I have? Because the code runs for a very long time whereafter it gets killed. I understand that the code is a long one, but the code doesn't give an error when running. Maybe there are option to specify some variables in such a way that it becomes easier for the system to run it.

All the other variables in the list are the one I need for my project, so if I am able to create a code that displays these variables I have all the data I need.

Kind regards,

Richard

Posted: **Mon Nov 25, 2019 12:58 pm**

Hello Richard,
table TLS214_NPL_PUBLN can be linked to TLS212_CITATION via the attribute NPL_PUBLN_ID.

Here is a query that extract all citation information (exept the citation categories) from a single patent applications.

Code: Select all

SELECT  citing.appln_id, citing.publn_auth, citing.publn_nr, citing.publn_kind, citing.publn_date, tls212_citation.*,
cited.publn_auth, cited.publn_nr, cited.publn_kind, cited.publn_date, tls214_npl_publn.*
  FROM tls211_pat_publn citing join tls212_citation on citing.pat_publn_id = tls212_citation.pat_publn_id
  join tls211_pat_publn cited on tls212_citation.cited_pat_publn_id = cited.pat_publn_id
  join tls214_npl_publn on tls212_citation.cited_npl_publn_id = tls214_npl_publn.npl_publn_id 
  join tls201_appln citing_app on citing.appln_id = citing_app.appln_id
  where citing_app.appln_nr_epodoc = 'WO2002US35880'
  --citing.publn_kind = 'a1' and citing.publn_auth = 'ep'  and citing.appln_id = 16011356
  order by citing.publn_date desc, citing.appln_id

Your query will generate an enormous amount of data because you have hardly any restrictions and a large number of joined tables. Each table that you join in a query (mostly) multiplies the number of records. You are trying to re-create a data set, but the data set will be de-normalised (because of all the joins), and it becomes in principle un-workable because of the size. For example, joining the TLS209_APPN_IPC table creates a new record (row in your table) for each existing row times the number of IPC classification codes for that application.
You can avoid this by using the STRING_AGG function, but the questions is whether you really need all those IPC codes in your final data set. The same question goes for the names of the applicants and the inventors, do you need all the names of each single inventor and applicant? (It multiplies your data set.)
The table TLS226_PERSON_ORIG is probably not needed in your query, the address is already in TLS206.
The TLS215_CITN_CATEG has to be linked to the cited publications, not the citing publication.

Here is a reworked/corrected query where I run the query to illustrate the above for 1 single application. It returns 36 records !
If you run this query for all applications where the "appln_filing_year BETWEEN 1998 AND 2018" (54 million applications), you can see that the amount of data will spiral out of control.

Code: Select all

SELECT DISTINCT a.appln_filing_year, a.appln_id, a.appln_auth, a.appln_nr, a.earliest_publn_year, a.granted, a.nb_citing_docdb_fam, 
r.appln_title ,
t.pat_publn_id, t.publn_auth, t.publn_date, t.publn_first_grant, t.publn_nr,
s.citn_origin, s.citn_replenished, s.npl_citn_seq_nr, s.pat_citn_seq_nr
,s.cited_appln_id --, 
,s.cited_pat_publn_id 
,b.citn_categ, b.citn_id, b.citn_replenished, b.pat_publn_id
,STRING_AGG (ipc_class_symbol, ', ') IPC_symbols
,q.psn_name, q.psn_id,  c.ctry_code


FROM tls201_appln a  
JOIN tls209_appln_ipc i ON a.appln_id = i.appln_id  
JOIN tls202_appln_title r ON a.appln_id = r.appln_id
JOIN tls211_pat_publn t ON a.appln_id = t.appln_id
JOIN tls212_citation s ON t.pat_publn_id = s.pat_publn_id
JOIN tls215_citn_categ b ON s.pat_publn_id = b.pat_publn_id and s.citn_replenished = b.citn_replenished and s.citn_id=b.citn_id
JOIN tls207_pers_appln o ON a.appln_id = o.appln_id  
JOIN tls206_person q ON o.person_id = q.person_id  
JOIN tls801_country c ON c.ctry_code = q.person_ctry_code

WHERE a.appln_id = 848

group by a.appln_filing_year, a.appln_id, a.appln_auth, a.appln_nr, a.earliest_publn_year, a.granted, a.nb_citing_docdb_fam, 
r.appln_title ,
t.pat_publn_id, t.publn_auth, t.publn_date, t.publn_first_grant, t.publn_nr,
s.citn_origin, s.citn_replenished, s.npl_citn_seq_nr, s.pat_citn_seq_nr
,s.cited_appln_id --, 
,s.cited_pat_publn_id 
,q.psn_name, q.psn_id,  c.ctry_code
,b.citn_categ, b.citn_id, b.citn_replenished, b.pat_publn_id
ORDER BY a.appln_filing_year, a.appln_id, t.pat_publn_id, pat_citn_seq_nr

Sollution:
You have to get back to drawing board and see what data you really need for your research, and most probably, you should try to do any "counts" directly on the main data base instead of extracting an enormous data set for further aggregation.

EPO - Discussion forums

Including variables without ID in the table

Including variables without ID in the table

Re: Including variables without ID in the table

Re: Including variables without ID in the table

Re: Including variables without ID in the table