DOCDB Family fractional count by country of inventors

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

Valentin
Posts: 3
Joined: Fri Jun 18, 2021 11:52 am

DOCDB Family fractional count by country of inventors

Post by Valentin » Fri Jun 18, 2021 12:25 pm

Hello everyone,

I would like some help and advices with a query aiming to execute a fractional count of DOCDB families from CPC Y02B by country of inventors (here France) on Patstat Online.

Here what I'm doing (kind of resourcefulness but it works fine) :

Code: Select all

SELECT DISTINCT T2.earliest_filing_year, SUM(T2.CASTFRACOUNT)
FROM
(SELECT T.earliest_filing_year, T.docdb_family_id, COUNT(T.REPLACED) AS COUNTALL, SUM(T.SIFR) AS COUNTFR, (CAST(SUM(T.SIFR) AS DECIMAL(3,1)) / CAST(COUNT(T.REPLACED) AS DECIMAL(3,1))) AS CASTFRACOUNT
    FROM
        (SELECT DISTINCT table1.earliest_filing_year, REPLACE(table4.person_ctry_code, '  ','NULL') AS REPLACED, table2.docdb_family_id,
        CASE 
        WHEN REPLACE(table4.person_ctry_code, '  ','NULL')='FR' THEN 1
        ELSE 0
        END SIFR
        FROM tls201_appln table1
        INNER JOIN tls225_docdb_fam_cpc table2 ON table1.docdb_family_id=table2.docdb_family_id
        INNER JOIN tls207_pers_appln table3 ON table1.appln_id=table3.appln_id
        INNER JOIN tls206_person table4 ON table4.person_id=table3.person_id
        AND table2.cpc_class_symbol LIKE 'Y02B%'
        AND table1.earliest_filing_year >=1970
        AND table1.earliest_filing_year<=2018
        AND ipr_type='PI'
        AND granted='Y'
        AND REPLACE(table4.person_ctry_code, '  ','NULL') LIKE '__'
        GROUP BY earliest_filing_year,REPLACE(table4.person_ctry_code, '  ','NULL'), table2.docdb_family_id) T 
GROUP BY T.earliest_filing_year, T.docdb_family_id) T2
GROUP BY T2.earliest_filing_year
ORDER BY T2.earliest_filing_year
1) I know something is approximative in the querry as I just delete the lines docdb_family_id with no results in person_ctry_code. Thereby, I miss some families that could be associates with a french inventor and I miss also an exact fractional count of families with a french among inventors and with an indeterminated inventors. Are these no results in person_ctry_code due to a lack of informations about the only, second, third or... inventors ? If so, how to remediate to this lack of information in my case ? I know the paper of (de Rassenfosse, 2012).

2) Is there any other mistakes that I could miss ? SQL always give you a result and we can easily don't see a mistake.

3) I know the earliest_filing_date is quite unconventional but I would like the date as near as possible from the date of invention.


Thanks a lot for your help,
Valentin


EPO / PATSTAT Support
Posts: 425
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: DOCDB Family fractional count by country of inventors

Post by EPO / PATSTAT Support » Mon Jun 21, 2021 1:03 pm

Hello Valentin,
looking at your query; it is a bit difficult to understand exactly what you would like to do with regards to the fractional counting. Can you explain in a couple of sentences, what you would like to do and how you expect the results to look like ?

If your methodology is to be used on patents having a French inventor in the 'Y02B%' area, it might be easier to simply extract that data set and work of-line.
That gives you more freedom to manipulate the data using whatever tools you prefer.
Using a query like below will give you about 10.000 patent applications with French inventors which is within the download limits for an extraction download.

Code: Select all

SELECT distinct  tls201_appln.appln_id
FROM   tls201_appln  join tls207_pers_appln on tls201_appln.appln_id = tls207_pers_appln.appln_id
join tls206_person on tls207_pers_appln.person_id = tls206_person.person_id
join tls224_appln_cpc on tls201_appln.appln_id = tls224_appln_cpc.appln_id
WHERE
invt_seq_nr > 0
and person_ctry_code = 'FR'
AND appln_filing_year > 1970
and cpc_class_symbol LIKE 'Y02B%'
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


Valentin
Posts: 3
Joined: Fri Jun 18, 2021 11:52 am

Re: DOCDB Family fractional count by country of inventors

Post by Valentin » Mon Jun 21, 2021 2:19 pm

Hello,

Thank you for your reply.

Sorry, I could have been more specific. For my thesis, I am working on the sources of innovation in the building sector to fight against climate change. My objective is to do some semantic analysis and topics modeling on patents in order to propose intrinsic technological classes. But first, I would like to do a descriptive analysis on the data and here I would like to study the cross-countries flow of knowledge. To do this, I focus on families rather than the applications themselves to avoid double counting inventions. Then, in order to obtain the origin of inventions, I focus on the countries of the inventors rather than on the application authorities (OECD, 2009). For these families with several inventors, I want to perform a fractional count (normalize 1 FR and 2 US = 1/3 FR and 2/3 US) to obtain an exact count by country.

The query of my first post gives a first approximation of this fractional number. As I am really new to Patstat Online and SQL language, my query is not very clean and some inventors are missing. I would like to know if a query, simpler perhaps, existed to carry out such a count. And I would also like to know if there is a way to retrieve the information concerning these missing inventors.

Thank you again,
Valentin


EPO / PATSTAT Support
Posts: 425
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: DOCDB Family fractional count by country of inventors

Post by EPO / PATSTAT Support » Tue Jun 22, 2021 12:25 pm

Hello Valentin,
it will not be easy; and I would say that this is not possible to do in 1 query- 1 go.
You will need to extract the data sets.

In the case of patents that have (at least) 1 FR inventor, there are 8582 patent applications.
They are grouped in 3359 docdb patent families. So for each one of these is families you will need to calculate the fractions.

Patent families can have different levels of completeness on inventor countries, they can have different filing dates and different earliest filing dates. So you will need to build a model that covers it all (or nearly all).

I would do the following (not saying that there might not be better ways):
from the 3359 docdb patent families families, 1059 have only 1 family member --> so no need to look further
further 1731 families have an EP patent in the family --> good source for inventor country codes without extra work.
further 490 families only have 1 inventor --> easy to calculate fraction (is 1...)
and from the remainder 593 patent families, 459 families only have FR inventors --> fraction is 1
leaves 134 families not covered (check/calculate manually ?)

This the query I used to analyse the data:

Code: Select all

SELECT 
distinct  tls201_appln.appln_id,appln_auth, nb_inventors, tls201_appln.docdb_family_id,  
 docdb_family_size ,STRING_AGG(CAST(person_ctry_code AS NVARCHAR(MAX)), ',')  inventors_country
--COUNT(distinct(  tls201_appln.docdb_family_id ))

FROM   tls201_appln  join tls207_pers_appln on tls201_appln.appln_id = tls207_pers_appln.appln_id
join tls206_person on tls207_pers_appln.person_id = tls206_person.person_id
join tls224_appln_cpc on tls201_appln.appln_id = tls224_appln_cpc.appln_id
WHERE
invt_seq_nr > 0
and tls201_appln.appln_id
in (select appln_id from tls207_pers_appln join tls206_person on tls207_pers_appln.person_id = tls206_person.person_id where person_ctry_code = 'FR' and invt_seq_nr > 0)
AND appln_filing_year > 1970
and cpc_class_symbol LIKE 'Y02B%'
and tls201_appln.docdb_family_id in (select docdb_family_id from tls201_appln where docdb_family_size > 1)
and tls201_appln.docdb_family_id  not in (select docdb_family_id from tls201_appln where appln_auth = 'EP')
and tls201_appln.docdb_family_id not in (select docdb_family_id from tls201_appln where nb_inventors = 1)

group by tls201_appln.appln_id, docdb_family_id, appln_auth, nb_inventors,docdb_family_size
having  STRING_AGG(CAST(person_ctry_code AS NVARCHAR(MAX)), ',')  like  '%[^FR,]%' --only French
order by docdb_family_id
Any "replenishment" you whish to do will need to be done before, but I am doubtful whether it would bring much extra for families having a FR inventor. (this would be different for CN).
The rest is for you.
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


Valentin
Posts: 3
Joined: Fri Jun 18, 2021 11:52 am

Re: DOCDB Family fractional count by country of inventors

Post by Valentin » Wed Jun 23, 2021 9:24 am

Hello,

Thank you very much for the reply, really helpfull !


Best regards,
Valentin


Paolo_TUM
Posts: 1
Joined: Wed Sep 14, 2022 3:00 pm

Re: DOCDB Family fractional count by country of inventors

Post by Paolo_TUM » Wed Sep 14, 2022 3:15 pm

Dear Patstat Support,

I have a similar inquiry as Valentin. I am also currently working on my Master thesis. I am not only looking at one country but at several countries as I will use the synthetic control method later on to compare Sweden with other countries regarding the patents.
I want to get the patent family and corresponding to that what fraction of countries the inventors are from.

I am currently exploring the PATSTAT database and trying to code the correct query for the data I am looking for. However, I am not very experienced in MYSQL and have difficulties retrieving the correct data.

Would it be possible that you help me in retrieving the data? This would be of great value to me.

What I am looking for is the following:

get the columns:
- patent family by INPADOC family not sure if docdb is methodologically that good for me to use. →
ipadoc_family_id
- appln_filing_date = patent priority date; appln_filing_year
- column corresponding to Y02T subclasses
- extract all patents falling into the category Y02T and subcategories of Y02T
- patents with their earliest filing year between 1978 and 2021 and their respective family
- only include families with at least one transnational filing (EPO or PCT)
- get the inventors’ country locations and weigh them within each family

Especially the last bullet point is difficult for me to get.

I tried to execute the code that was posted as the last post in this forum but it does not execute correctly because:
I have access to PATSTAT through my university (Technical University Munich) and I am connecting to their database to get to PATSTAT so I have to use MySQL queries. I believe the code posted was in Postgresql or something similar.
I changed it slightly. There might be a mistake as I am very new to SQL. My version is the following:
Altough it runs without an error I get an empty result (No rows only the column header as an output). Howevever, as I did not. change the specifics in the code I should get the same result, which is at least something, as the person who posted the code.

-- SQLINES LICENSE FOR EVALUATION USE ONLY
CREATE TABLE paolo_oppelt.my_sample_2
SELECT
distinct tls201_appln.appln_id,appln_auth, nb_inventors, tls201_appln.docdb_family_id,
# 1. Convert PostgreSQL to MYSQL then debug MYSQL
# (MAX) is postgresql not mysql
# use longtext instead
# problem with longtext is that CAST() does not support longtext
# use char instead
# maybe use GROUP_CONCAT instead of STRING_AGG
# this is why I have to delete this line:
# docdb_family_size ,GROUP_CONCAT(CAST(person_ctry_code AS char), ',') inventors_country
# and replace it with
# https://database.guide/mysql-group_conc ... tring_agg/

docdb_family_size ,GROUP_CONCAT(CAST(person_ctry_code AS char)) inventors_country
# COUNT(distinct( tls201_appln.docdb_family_id ))

FROM tls201_appln join tls207_pers_appln on tls201_appln.appln_id = tls207_pers_appln.appln_id
join tls206_person on tls207_pers_appln.person_id = tls206_person.person_id
join tls224_appln_cpc on tls201_appln.appln_id = tls224_appln_cpc.appln_id
WHERE
invt_seq_nr > 0
and tls201_appln.appln_id
in (select appln_id from tls207_pers_appln join tls206_person on tls207_pers_appln.person_id = tls206_person.person_id where person_ctry_code = 'FR' and invt_seq_nr > 0)
AND appln_filing_year > 1970
and cpc_class_symbol LIKE 'Y02B%'
and tls201_appln.docdb_family_id in (select docdb_family_id from tls201_appln where docdb_family_size > 1)
and tls201_appln.docdb_family_id not in (select docdb_family_id from tls201_appln where appln_auth = 'EP')
and tls201_appln.docdb_family_id not in (select docdb_family_id from tls201_appln where nb_inventors = 1)

group by tls201_appln.appln_id, docdb_family_id, appln_auth, nb_inventors,docdb_family_size
# # https://database.guide/mysql-group_conc ... tring_agg/
having GROUP_CONCAT(CAST(person_ctry_code AS char)) like '%[^FR,]%' # -- only French
order by docdb_family_id

I read the article https://www.researchgate.net/publicatio ... le_Queries on PATSTAT and they explain how to get the inventor fraction but only for patent application not for the family.

I tried their code but it seems to run for ages (more than 40min) after which I abort it, thus I don't get results.


# These are the columns that are selected as output they are from different input tables
# fractional_count is one of the output columns. It is created as a new column
# t1 and t2 are not tables but subqueries
# t1 reports the count of inventors by country
# t2 reports the total number of inventors by patents
SELECT
person_ctry_code, SUM(tot_in_ctry/tot_in_patent) AS fractional_count
FROM
(SELECT
# I think t my table
t.appln_id,
# The IFNULL() function returns a specified value if the expression is NULL.
ifnull(t1.person_ctry_code, '') AS person_ctry_code,
ifnull(t1.tot_in_ctry, 1) AS tot_in_ctry,
ifnull(t2.tot_in_patent, 1) AS tot_in_patent
FROM
paolo_oppelt.our_sample t
LEFT OUTER JOIN
# --> Accounts for missing inventor references in
# tls207_pers_appln table
(SELECT
a.appln_id,
b.person_ctry_code,
COUNT(b.person_id) AS tot_in_ctry
FROM
tls207_pers_appln a
INNER JOIN tls206_person b ON a.person_id = b.person_id
WHERE
a.invt_seq_nr > 0
GROUP BY a.appln_id, person_ctry_code
# --> Compiles country-level count of inventors per patent
) t1 ON t.appln_id = t1.appln_id
LEFT OUTER JOIN
(SELECT
appln_id, MAX(invt_seq_nr) AS tot_in_patent
FROM
tls207_pers_appln
GROUP BY appln_id HAVING MAX(invt_seq_nr) > 0
# --> Compiles total count of inventors per patent
) t2 ON t.appln_id = t2.appln_id
) our_sample_with_country
GROUP BY person_ctry_code;
# ORDER BY SUM(tot_in_ctry/tot_in_patent) DESC;
CREATE VIEW new_sample AS

It would be great if you could help me with my inquiry as it is quite difficult for me to find the problem in the code for me because I am new to coding with DQL.

Thank you very much.

Kind regards,
Paolo


Post Reply