Citation coverage and company classification

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

Aman
Posts: 2
Joined: Fri Mar 10, 2023 7:11 pm

Citation coverage and company classification

Post by Aman » Fri Mar 10, 2023 9:02 pm

Hello,

I have two questions related to the coverage of citation data, and the classification of persons into companies or individuals. I am trying to compute the citations that companies headquartered in different countries make to each other, combining the data in table TLS212_CITATION and TLS206_PERSON using TLS227_PERS_PUBLN as a link between the two.

My first question concerns the coverage of table TLS212_CITATION. I have seen the answer to this thread: forward-citations-and-coverage-9180 and had a look at these two documents: https://documents.epo.org/projects/baby ... 202023.pdf, https://documents.epo.org/projects/baby ... 202301.pdf. My understanding is that these contain the number of citations made and received that I can find in PATSTAT.
However, I am not able to reproduce these numbers at all. What do they represent? Are they unique citations from a document to another? How are the countries in this table assigned?

True, I have the Spring 2020 edition of PATSTAT, but when I run the simplest query:

Code: Select all

SELECT 
		tCIT.PAT_PUBLN_ID,
		tCIT.CITED_PAT_PUBLN_ID
	FROM
		TLS212_CITATION tCIT


I only get a little less than 367 million observations, more than 100 million less than what the table "Overview public citation data in EPO's citation database (REFI)" reported as the number of cited documents.
It would be very helpful to know what is the query to reproduce the table (number of citing and cited by country) so that I can understand what I am doing wrong.

My second question concerns the classification on persons into company or individual. As part of my exercise, I need to understand in which countries companies are incorporated. The way I went about this was using the field PSN_SECTOR in TLS206_PERSON and restrict to observations that contained "COMPANY" in the field. However, I noticed that these numbers are quite small and exhibit substantial breaks over time. For example, running the query

Code: Select all

	SELECT
		COUNT(t3.EARLIEST_PUBLN_YEAR) AS NUM_PER_YEAR,
			t3.EARLIEST_PUBLN_YEAR,
			t2.PERSON_CTRY_CODE
		FROM
			TLS201_APPLN t3
		JOIN
			TLS207_PERS_APPLN t1
				ON t3.APPLN_ID = t1.APPLN_ID
		JOIN
			TLS206_PERSON t2 
				ON t1.PERSON_ID = t2.PERSON_ID
		WHERE t2.PSN_SECTOR LIKE '%COMPANY%'
			AND t2.PSN_SECTOR NOT LIKE '%GOV%'
			AND t2.PSN_SECTOR NOT LIKE '%NON-PROFIT%'
			AND t2.PSN_SECTOR NOT LIKE '%UNIVERSITY%'
		GROUP BY 
			t3.EARLIEST_PUBLN_YEAR,
			t2.PERSON_CTRY_CODE
Looking at the US as the PERSON_CTRY_CODE, I see that the number of publication by persons classified as companies have some huge fluctuations. For example, the number of publication goes from 4395 to 2454 from 1962 to 1963; then from 5163 in 1967 to 9654 in 1968 and then 21141 in 1969, so a four-fold increase in just two years! Similarly from 25k to 60k between 1975 and 1976, etc… I can imagine that these can partly reflect business cycles, but looking at the overall series of publications there do not appear to be such wild fluctuations in patenting.

Can this data be relied upon or do I have to just look at the country of persons, bunching all together? I would like to arrive at actual corporate assignees of the patents, so I would appreciate alternative suggestions to arrive at the same result.

Thanks a lot for your help.


EPO / PATSTAT Support
Posts: 440
Joined: Thu Feb 22, 2007 5:33 pm
Contact:

Re: Citation coverage and company classification

Post by EPO / PATSTAT Support » Tue Mar 14, 2023 7:24 pm

On the differences between the "coverage" document and the data.
"Small" differences can occur; we have replenished in PATSTAT the missing citations, and the PDF document is created on the source data, not the PATSTAT data. We also de-duplicated the NPL in PATSTAT, which would lead to a bit lower figures for NPL.
I have run the below queries to compare with the "AU" data (as a test) and it comes pretty close.

Code: Select all

SELECT publn_auth,
COUNT(distinct(tls211_pat_publn.pat_publn_id )) as "citing_pubs" 
FROM tls211_pat_publn join tls212_citation 
on tls211_pat_publn.pat_publn_id = tls212_citation.pat_publn_id 
and publn_auth in ( 'au')
where tls211_pat_publn.pat_publn_id <> 0
group by publn_auth
order by publn_auth;
SELECT publn_auth,
COUNT(tls212_citation.cited_pat_publn_id) as "cited_pubs" 
  FROM tls211_pat_publn join tls212_citation 
  on tls211_pat_publn.pat_publn_id = tls212_citation.pat_publn_id 
  and publn_auth in ( 'au')
  where cited_pat_publn_id <> 0
group by publn_auth
order by publn_auth;
SELECT publn_auth,
COUNT(distinct(tls212_citation.cited_npl_publn_id )) as "cited_npl" 
  FROM tls211_pat_publn join tls212_citation on tls211_pat_publn.pat_publn_id = tls212_citation.pat_publn_id and publn_auth in ( 'au')
  where cited_npl_publn_id  <> ''
group by publn_auth
order by publn_auth;
The only thing that is a bit strange is that seemingly distinct publications were counted and not the total. But the purpose of that document is rather for comparison between the patent authorities then for checking the exact numbers. Bottom line: citations data in PATSTAT comes from REFI (source data base), and we are 100% sure it's correct in PATSTAT.
The "Country" is the patent authority that published the application. (So called publn_auth in table tls211.) The table does not give any insight on "citing" publications, only cited. (in totals). One could of course look into "who is citing who" based on the patent publication authorities. Patent offices do this sometimes to showcase that their examiners not only look at "English - language" citations, but also cite for example Asian prior art. (for the EPO to speak).

The second query is wrong from a logical point of view. It counts "years". The only purpose of the "sector" is to classify the applicants. It does not say where they are "incorporated". Sometimes large companies have local national representations. You can see that in the data from the query below, AIRBUS OPERATIONS in DE and FR each file patents at the USPTO and EPO.
The assignment of the "sector" to the applicant is always done on the previous PATSTAT release (together with the harmonisation). So it might be that some companies have not been harmonised or "sectorised" yet.

Have a look at this forum topic sql-queries-to-create-simple-applicant- ... able-10588. You can easily adapt it to include the sector, country code, appln_auth, etc...

Code: Select all

SELECT top 1000  psn_name,psn_sector, person_ctry_code,appln_auth,
COUNT(distinct(case when appln_filing_year = 2010 then tls201_appln.appln_id end)) as '2010',
COUNT(distinct(case when  appln_filing_year = 2011 then tls201_appln.appln_id end)) as '2011',
COUNT(distinct(case when  appln_filing_year = 2012 then tls201_appln.appln_id end)) as '2012',
COUNT(distinct(case when  appln_filing_year = 2013 then tls201_appln.appln_id end)) as '2013',
COUNT(distinct(case when  appln_filing_year = 2014 then tls201_appln.appln_id end)) as '2014',
COUNT(distinct(case when  appln_filing_year = 2015 then tls201_appln.appln_id end)) as '2015',
COUNT(distinct(case when  appln_filing_year = 2016 then tls201_appln.appln_id end)) as '2016',
COUNT(distinct(case when  appln_filing_year = 2017 then tls201_appln.appln_id end)) as '2017',
COUNT(distinct(case when  appln_filing_year = 2018 then tls201_appln.appln_id end)) as '2018',
COUNT(distinct(case when  appln_filing_year = 2019 then tls201_appln.appln_id end)) as '2019',
COUNT(distinct(case when  appln_filing_year = 2020 then tls201_appln.appln_id end)) as '2020',
count (tls201_appln.appln_id) total
FROM   tls201_appln  join tls207_pers_appln on tls201_appln.appln_id = tls207_pers_appln.appln_id
join tls206_person on tls207_pers_appln.person_id = tls206_person.person_id
WHERE
appln_auth in ( 'EP','US')
AND appln_filing_year between 2010 and 2020
and applt_seq_nr > 0 and invt_seq_nr = 0
and granted = 'y' --remove this line if you want to have the number of patents filed instead of patents granted 
and psn_sector = 'company'
group by psn_name, psn_sector,person_ctry_code, appln_auth
order by total desc
PATSTAT Support Team
EPO - Vienna
patstat @ epo.org


Aman
Posts: 2
Joined: Fri Mar 10, 2023 7:11 pm

Re: Citation coverage and company classification

Post by Aman » Tue Mar 14, 2023 8:20 pm

Thanks for the helpful answers!


Post Reply