Bibliographic Coupling Results Query

Here you can post your opinions, ask questions and share experiences on the PATSTAT product line. Please always indicate the PATSTAT edition (e.g. 2015 Autumn Edition) and the database (e.g. PATSTAT Online, MySQL, MS SQL Server, ...) you are using.
Post Reply

acastor
Posts: 2
Joined: Sat Sep 05, 2015 7:00 pm

Bibliographic Coupling Results Query

Post by acastor » Sat Sep 05, 2015 7:29 pm

I am trying to calculate bibliographic coupling for all 4-digit IPC class pairs for a given year using only citations from and to documents in the same county (e.g., 'DE'). Essentially, this means for each class-class pair, I want to do the following: count the number of citations that patents in one class (e.g., ipc_subclass_symbol='A01B') have in common with patents from another class (e.g., 'A01C').

For the results I am looking for a table with one observation for each 4-digit IPC class pair, including self parings. The three columns would be the count, and each of the 4-digit IPC codes:
Class1 Class2 countvar
A01B A01B 123
A01B A01C 1
A01B A01D 5
...
A01C A01B 1
A01C A01C 123
...

Assuming that there are ~450 4-digit IPC class codes, this should result in a table with a maximum of 450^2 = 202,500 observations. In reality, this will be a very small fraction of that because most pairings will have 0 observations and will not make it into the returned table.

The code that I have created to accomplish this task is:

Code: Select all

SELECT COUNT(*) AS obs, classA.ipc_subclass_symbol, classB.ipc_subclass_symbol 
FROM tls212_citation as citeA
	INNER JOIN tls211_pat_publn as patA ON citeA.pat_publn_id=patA.pat_publn_id
		AND patA.publn_auth = 'DE' 
        	AND DATEPART(year,patA.publn_date)=2010
	INNER JOIN tls209_appln_ipc as classA ON patA.appln_id=classA.appln_id
		AND classA.ipc_position='F'
		AND classA.ipc_subclass_symbol != 'NULL'
	INNER JOIN tls212_citation as citeB ON citeA.cited_pat_publn_id=citeB.cited_pat_publn_id
	INNER JOIN tls211_pat_publn as patB ON citeB.pat_publn_id=patB.pat_publn_id 
        	AND patB.publn_auth = 'DE' 
        	AND DATEPART(year,patB.publn_date)=2010
	INNER JOIN tls209_appln_ipc as classB ON patB.appln_id=classB.appln_id 
		AND classB.ipc_position='F' 
		AND classB.ipc_subclass_symbol != 'NULL'
        	AND classA.ipc_subclass_symbol = classB.ipc_subclass_symbol 
GROUP BY classA.ipc_subclass_symbol, classB.ipc_subclass_symbol
ORDER BY classA.ipc_subclass_symbol, classB.ipc_subclass_symbol

The problem that I am having is that I get errors that the estimated cost of the query is larger than the 1,000,000 limit. Also, I'm guessing the query will take ages.

Can anyone help me tackle this problem with different or more efficient code that I can actually use? I would much rather pull the data this way from the server rather than have to download huge tables (with one obs per citation) and do the calculation on my side.... Any help is needed. I should note that I am somewhat new to SQL.

Thank you in advance!


nico.rasters
Posts: 140
Joined: Wed Jul 08, 2009 5:51 pm
Contact:

Re: Bibliographic Coupling Results Query

Post by nico.rasters » Tue Sep 08, 2015 9:05 pm

I'd do this in php and some intermediate tables, but I assume you are working with PATSTAT Online so that's not an option.

I think you can leave out: "AND classA.ipc_subclass_symbol != 'NULL'" (because the INNER JOIN will drop all patents that have no IPC information) as well as ORDER BY classA.ipc_subclass_symbol, classB.ipc_subclass_symbol (this just slows down your query... you should order the results in Excel).

Note that:
Due to `citn_origin` you may find duplicate citations for any given publication.
Also, using the 4-digit IPC class means that a 4-digit IPC class can occur more than once for any given patent. You should consider how you want to count them.
It is probably better to use family-family citations, but that also makes things more complicated.
________________________________________
Nico Doranov
Data Manager

Daigu Academic Services & Data Stewardship
http://www.daigu.nl/


acastor
Posts: 2
Joined: Sat Sep 05, 2015 7:00 pm

Re: Bibliographic Coupling Results Query

Post by acastor » Tue Sep 08, 2015 11:02 pm

Yes, unfortunately, I am trying to do this in the online version.... Regardless, based on your helpful hints, I've altered the code, though it is still too costly to run:

SELECT COUNT(*) AS obs, classA.ipc_subclass_symbol, classB.ipc_subclass_symbol
FROM tls212_citation as citeA
INNER JOIN tls211_pat_publn as patA ON citeA.pat_publn_id=patA.pat_publn_id
AND patA.publn_auth = 'DE'
AND DATEPART(year,patA.publn_date)=2010
AND citeB.citn_origin='SEA'
INNER JOIN tls209_appln_ipc as classA ON patA.appln_id=classA.appln_id
AND classA.ipc_position='F'
INNER JOIN tls212_citation as citeB ON citeA.cited_pat_publn_id=citeB.cited_pat_publn_id
AND citeB.citn_origin='SEA'
INNER JOIN tls211_pat_publn as patB ON citeB.pat_publn_id=patB.pat_publn_id
AND patB.publn_auth = 'DE'
AND DATEPART(year,patB.publn_date)=2010
INNER JOIN tls209_appln_ipc as classB ON patB.appln_id=classB.appln_id
AND classB.ipc_position='F'
GROUP BY classA.ipc_subclass_symbol, classB.ipc_subclass_symbol


Notes: you were absolutely right about the citation origin. I added code to include only search report citations. This should eliminate the problem of duplicate citations. I also eliminated the null restrictions. I can get rid of them manually myself after download.... Re multiple 4-digit ipc codes, shouldn't selecting the classA.ipc_position='F' including only the 4-digit primary code, so there should only be one. Or am I missing something again?


nico.rasters
Posts: 140
Joined: Wed Jul 08, 2009 5:51 pm
Contact:

Re: Bibliographic Coupling Results Query

Post by nico.rasters » Wed Sep 09, 2015 9:33 pm

I have been trying to rewrite the query to make it run faster but to no avail. I am afraid that your only option is to download data and perform the calculation somewhere else. You can not create views or tables in PATSTAT Online, nor can you create indices. The query you are trying to run is simply too complex. Also, it contains a typo. Line 6 should be AND citeA.citn_origin='SEA' instead of AND citeB.citn_origin='SEA'.

A few comments:
  • What is the theoretical background for using the same publication year and country?
  • If you look at citn_origin=SEA only then you will miss out on some citations
  • If you focus on ipc_position=F you will indeed avoid a double count, but you will miss out on some IPC codes because F is not always assigned as far as I know
And some thoughts...
Basically an invention is represented by a patent family (DOCDB). Each family consists of one or more applications. Each application has one or more publications. Each publication has zero or more backward and forward citations.
From Wikipedia: "Two documents are bibliographically coupled if they both cite one or more documents in common.". We can translate this into: "Two inventions -or patent families- are bibliographically coupled if they both cite one or more inventions in common."
You can tweak this definition to fit your approach of looking at IPC subclasses.

However, not all citations are the same. See tls215_citn_categ. I also would not restrict myself to `ipc_position`="F", and you should keep in mind that your results will vary depending on the level of detail of the IPC class. You are using the subclass instead of the full IPC code. It is very popular to do so, but you should understand the consequences.

I can't promise anything but maybe I can manage to write a php script that calculates your bibliographic coupling. And then I just need to ask one of my buddies to run it on a more recent version of PATSTAT.
Anyway, please think about your approach a bit more first.
________________________________________
Nico Doranov
Data Manager

Daigu Academic Services & Data Stewardship
http://www.daigu.nl/


Geert Boedt
Posts: 176
Joined: Tue Oct 19, 2004 10:36 am
Location: Vienna

Re: Bibliographic Coupling Results Query

Post by Geert Boedt » Wed Sep 23, 2015 9:59 am

There is a bit of a pitfall here:
The tls212_citation table is also used to link citing publications to NPL in the tls214_npl_publn table.
In which case the attribute cited_pat_publn_id will be 0, and will be linked to the 0 record in the tls211_pat_publn_table. I assume the "publn_auth = 'DE' " restriction is only applied "later" in the execution after the tables are joined. So I assume your query creates massive internal tables even though the final table is rather small. I therefore experimented a bit with limiting the source tables via "where's" in the "selects" instead of applying later conditions on the joined tables.

Have a look at the query below which does not trigger the "cost message",
and hopefully gives you the data you are looking for.

Code: Select all

SELECT COUNT(*) AS obs, classA.ipc_subclass_symbol, classB.ipc_subclass_symbol
FROM       (select  * from tls212_citation where citn_origin = 'SEA'  and cited_pat_publn_id <> 0)as citeA
INNER JOIN (select * from tls211_pat_publn where publn_auth = 'DE' and year(publn_date) = 2010) as patA ON citeA.pat_publn_id=patA.pat_publn_id
INNER JOIN (select * from tls209_appln_ipc where ipc_position='F' )as classA ON patA.appln_id=classA.appln_id
INNER JOIN (select * from tls212_citation where citn_origin = 'SEA'   and cited_pat_publn_id <> 0 )as citeB ON citeA.cited_pat_publn_id=citeB.cited_pat_publn_id
INNER JOIN (select * from tls211_pat_publn where publn_auth = 'DE' and year(publn_date) = 2010) as patB ON citeB.pat_publn_id=patB.pat_publn_id
INNER JOIN (select * from tls209_appln_ipc where ipc_position='F' ) as classB ON patB.appln_id=classB.appln_id
GROUP BY classA.ipc_subclass_symbol, classB.ipc_subclass_symbol 
Geert Boedt
PATSTAT customer support
Best regards,

Geert Boedt
PATSTAT support
Business Use of Patent Information
EPO Vienna


Post Reply