Archive for Stata

Thesis turned in!

sortingIt’s been a hectic few weeks, but it felt great to finally hand in my thesis this afternoon. I have dropped on a page the abstract of my paper and a graphical walk-through of my findings. The paper itself is available here. I’m especially grateful to my thesis advisor, Oded Galor, for so many conversations and comments. I’m also very appreciative of many friends for helpful discussions along the way about my findings (and particularly to Chris and Christine for comments this week!).

I’ll be presenting my thesis to the Economics department May 1 (anyone is welcome to attend). I’ll also be making a less technical presentation at Theories in Action on Sunday, April 28.

Submarine Patents from a 21st Century Vantage Point: Honors Thesis Proposal

I am writing an interdisciplinary senior thesis at Brown spanning the fields of computer science and economics. The subject is submarine patents.

A submarine patent is a patent whose prosecution review at the US Patent & Trademark Office is purposefully prolonged by the applicant, in the hopes of “emerging” some years later with a patent on what has become a fundamental technology, extracting licensing fees from businesses who have already built upon this technology without knowledge of the patent’s filing.

Two reforms made submarine patents much less worthwhile to pursue. While patent terms used to be determined from issue date, starting in June of 1995, all new patent filings would receive terms from date of filing — the fact that the clock was ticking during prosecution made stalling at this stage much less desirable. A further reform came in November of 2000, when the USPTO announced that most patent applications would be published to the world 18 months after filing. Given the changes in patent term structure and the lifting of the veil of secrecy surrounding patent applications, the ability for inventors to unexpectedly corner a market long after filing their invention has been effectively eliminated.

Despite the closure of these loopholes, submarine patents continue to issue. In examining all patents that have issued over the past several decades, a single anomaly is prominent: many applications self-sorted to file prior to the closing of the loophole. A tremendous number of patent applications were filed in these days and weeks, and we now see that these were no ordinary applications. In fact, applications filed in these few weeks represent the most pronounced spike in average pendency in modern history. Identifying submarine patents as those filing prior to this discontinuity offers a unique vantage point from which to study the motives for and outcomes of submarine patents.

First, I have downloaded the full text of every patent granted in the past three decades.

I transformed these documents into roughly the following relations (number of tuples in parentheses):

  • Basic bibliographic info — one line for each patent grant (4.8M) (dta sample)
  • Assignee: name, address, etc. (for all assignees of patent) (4.3M) (dta sample)
  • Inventor: name, address, etc. (for all assignees of patent) (11M) (dta sample)
  • References to other patents (in the US and abroad) (55M) (dta sample)
  • References to “non-patent literature” (papers, brochures, etc.) (16M) (dta sample)
  • Parents (1.8M) (dta sample)
  • Fields searched by examiner as prior arts (11M) (dta sample)

Presentation to Brown’s Economics Honors Thesis Class Nov. 20, 2012 (PDF):

A Python Tutorial for Economists

It’s been a few months since I’ve posted here; blogging was a bit taboo this summer at work (though it turns out I found plenty of other ways to raise red flags for the Cyber Security team using just Python + the Interwebs).

Working in an office with several other research assistants who were proficient with some statistical scripting languages (Stata, SAS), I began to think there’s probably a niche for a more general-purpose language in academic social science research (as well as in automating some of the tasks involved with casework around the office). I was already using Python in much of my work. What started out as a few trips to coworkers’ desks to help them write this or that script quickly turned into a few pages of notes, and that turned into some thirty pages of charts, explanations, and instructional tasks. (I must note, the final formatting was inspired by the style of my linear algebra lecture notes from last semester.)

I presented a version of Python for Economists to some coworkers at the FTC Bureau of Economics in July. I’ve been a student of three different college classes that taught Python from scratch, but I’ve never seen a way of teaching Python that I thought was appropriate for students already familiar with scripting languages such as Stata. I focus on two broad applications of Python I’ve found very useful in social science research: web scraping and textual processing (including regular expressions).


  • PDF of the booklet (34 pages, colored Python syntax highlighting)
  • Zipped supporting materials used in the exercises

What is on Google Patents?

I’m a bit disappointed now that I’m finally going through the data I downloaded from Google Patents throughout the semester. It doesn’t seem like it will be very useful for looking at patent trends prior to 2000. It’s unclear what sampling of patent applications they’re actually providing; I wish they were more transparent about what data they’re providing.

Global Interplays of Values, Wealth, and Geography

I wrote a research paper for my course in Geographic Information Systems (GIS). I had a blast writing it, and there was plenty of Stata and ArcGIS play to be done. I’m posting a download link here, in case anyone is as excited about this kind of stuff as I am.


  • It’s true that distance to the equator is a good predictor of wealth. It predicts 28% of the variance in per-capita GDP! This was a satisfying result — highly significant.
  • We can do similar studies with measures of cultural values. It seems that many values also change similarly with geography.
  • Using four fairly arbitrarily selected values as regressors, about two thirds of the variance in per-capita income can be predicted. This was really surprising to me, and it seems like there’s a lot of similar work to be done here regarding mapping and cultural values.

How to pronounce Stata

Because it’s come up with so many ECON1620 students I’ve talked to this semester.

From the Stata FAQ:

4. The names Stata and Mata

4.1 What is the correct way to pronounce ‘Stata’?

Stata is an invented word. Some pronounce it with a long a as in day (Stay-ta); some pronounce it with a short a as in flat (Sta-ta); and some pronounce it with a long a as in ah (Stah-ta). The correct English pronunciation must remain a mystery, except that personnel of StataCorp use the first of these. Some other languages have stricter rules on pronunciation that will determine this issue for speakers of those languages. (Mata rhymes with Stata, naturally.)

4.2 What is the correct way to write ‘Stata’?

Stata is an invented word, not an acronym, and should not appear with all letters capitalized: please write “Stata”, not “STATA”. Mata is also an invented word, not an acronym.


When you’re using large datasets, it’s not uncommon for your do-file to take several hours to run. Especially if you run it overnight, you might be curious how long it took to run. Stata puts timestamps at the openings and closings of logfiles, but maybe you want to know the time taken for a given command.

I remember one of my first tasks as a research assistant was to extract a few years’ records from a tremendous file for a data request (we later broke the file up into smaller files). Believe it or not, this task was well into the realm of “things so long and resource-intensive that you run them overnight,” and I was curious to know just how much of the time was spent simply loading the file into main memory, even before selecting the relevant years and projecting on the attributes requested then writing output to disk. Here’s a nice .ado file I’ve been using ever since. I think it’s an important command to help users find out which commands are the most efficient for their purposes, and I’m surprised Stata doesn’t include it.

program define clock

di “starting at ” c(current_time)
timer on 1
timer off 1
quietly timer list
di “finished at ” c(current_time)
di “total length was ” r(t1) ” seconds”
timer clear


A fun extension I was considering would be to have it write out to a simple comma-delimited log something like the current timestamp, the command run, and the time taken to execute. For those sharing a server with coworkers like I was, it would be interesting to see how certain commands’ efficiencies compare under different network conditions. For instance, I always wondered what the cost was of having two users load data from disk into memory at the same time, (perhaps) causing non-sequential disk reads with extra seeks. Users who don’t share resources might still be interested in evaluating efficiency while other programs are running.

The 411 on Home-Brewed Auto-Do Files: init.ado

If you know any Stata commands, then you know enough to make your own auto-do file. This walkthrough will create a simple ado file you can invoke that welcomes you to Stata and navigates to the directory you typically work from.

An ado file is a script that you invoke from Stata’s command line. One example of an ado file is “describe,” one of the first commands Stata learners use to see information about the variables in their dataset. In Stata, type “which describe,” and Stata gives you a directory path. When you invoke “describe,” Stata executes the script at that location (on my Mac, for example, “/Applications/Stata/ado/base/d/describe.ado”). If you pass parameters to an auto-do file, they are stored as local macros `0′, `1′, `2′, etc. Another post about clock.ado uses parameters.

If you open the describe.ado script in your favorite text editor, you’ll see something like:

*! version 2.1.0  25feb2010
program describe, rclass

version 9
local version : di “version ” string(_caller()) “:”
syntax [anything] [using] [, SImple REPLACE *]

… a bunch of other Stata code, just like a do-file …


Make a document called “init.ado” in any text editor, and save it in your “personal” directory you found from the sysdir command. It should be in a directory “ado.” If the directory “personal” doesn’t exist, make it inside “ado.”

For this fun ado file, we’ll use the “cd” command, which stands for “change directory.” Have Stata navigate to the folder you usually work from by typing “cd <directory path>” (eg, the path to a course’s folder). From there, you can cd into a subfolder (eg, the current problemset or project you’re working on), and use Stata’s “doedit” command to open the .do file there.

Add to your ado file any other helpful commands you tend to run, and you can even impress your friends by having Stata welcome you each time you type it.

program define init
cd /Users/Alex/Homework/Junior
set more off
di “Hello, Mr. Bell. What may I compute for you today?”

A helpful note on testing ado files is that Stata seems to cache them (it saves time not to have to look for an ado file every time you invoke it, but to store the script once it’s been searched for). That means if you write your ado file, test it, then go back to edit and re-save it, you’ll probably have to restart Stata to make it load the new version.