On collecting data

When collecting data, how do you save it?

There are about a million different options. In Python, you can choose from many different libraries:

pickle
numpy
yaml
json
csv
sql

… just to name a few. Over time I’m fairly certain that I’ve managed to save data (whether behavioral or simulation) in all of these formats. This is really inconsistent: it makes it difficult to know what encoding any particular dataset is in, let along the format of the data itself, and it means I end up writing and rewriting code to do saving, loading, parsing, etc., more times than I ought to.

I realize there are libraries like SQLAlchemy, but this is actually a bit too powerful for what I want. I want something simple and easy to remember, and I want it to interface with the tools that I already use; I’m not terribly interested in executing cross-table queries, for example. However, I do want some of the functionality of SQL/SQLite (multiple tables, locking against concurrent access, flexible queries, etc.).

So, I’ve begun writing a small module called dbtools (very creative name, I know) that handles simple interfacing with a SQLite database. Inspired by ipython-sql, my library returns pandas DataFrame objects from SELECT queries, and can handle basic forms of other SQL statements (CREATE, INSERT, UPDATE, DELETE, and DROP).

This is obviously very much a work in progress, and I’m not even sure this will be useful for others! But, I do need a better, more convenient way to save data – especially behavioral data collected online – and so that’s the goal of this library. If you’re interested, you can find the source on GitHub (patches/contributions welcome!).

Here are some examples of how it works.

Create and load

```python Creating a table

from dbtools import Table tbl = Table.create(“data.db”, “People”, … [(‘id’, int), … (‘name’, str), … (‘age’, int), … (‘height’, float)], … primary_key=’id’, … autoincrement=True) tbl People(id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, age INTEGER, height REAL) type(tbl)

If a table already exists, we can just directly create a Table object:

```python Accessing a table
>>> tbl = Table("data.db", "People")
>>> tbl
People(id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, age INTEGER, height REAL)
>>> tbl.columns
(u'id', u'name', u'age', u'height')
>>> tbl.primary_key
u'id'
>>> tbl.autoincrement
True

Insert

Inserting with a list (excluding id, because it autoincrements):

```python Inserting a list

tbl.insert([“Alyssa P. Hacker”, 25, 66.24]) tbl.select() name age height id 1 Alyssa P. Hacker 25 66.24 type(tbl.select())

Inserting with a dictionary:

```python Inserting a dictionary
>>> tbl.insert({
... 'name': 'Ben Bitdiddle',
... 'age': 24,
... 'height': 70.1})
>>> tbl.select()
                name  age  height
id
1   Alyssa P. Hacker   25   66.24
2      Ben Bitdiddle   24   70.10

You can insert as many things as you want as a time – just pass them in as a list of lists and/or dictionaries.

Select

In the previous two examples, I already used an instance of selection with tbl.select(), which is the equivalent of doing FROM People SELECT *. You can use slicing to select rows (but only if the primary key column is an integer and autoincrements). Note that because SQLite databases are one-indexed, indexing the zeroth element returns an empty DataFrame.

```python Selecting rows

tbl[1] name age height id 1 Alyssa P. Hacker 25 66.24 tbl[2:] name age height id 2 Ben Bitdiddle 24 70.1 ```

If you pass in a string or sequence of strings, it will treat them as column names and select those columns:

```python Selecting columns

tbl[‘name’, ‘height’] name height id 1 Alyssa P. Hacker 66.24 2 Ben Bitdiddle 70.10 ```

More advanced selection can be done through the select method by specifying the where keyword argument (and you can use the ? syntax from the sqlite3 library for untrusted inputs):

```python Selection with WHERE

tbl.select(where=’age>24’) name age height id 1 Alyssa P. Hacker 25 66.24 tbl.select(columns=[‘name’, ‘height’], where=(‘age>?’, 24)) name height id 1 Alyssa P. Hacker 66.24 ```

Update

Updating data in the table works by taking a dictionary (with the keys being columns, and values being new data) and (optionally) a where keyword argument like in the select method to specify what data should be updated.

```python Updating

tbl.update({‘age’: 26}, where=’id=1’) tbl.select() name age height id 1 Alyssa P. Hacker 26 66.24 2 Ben Bitdiddle 24 70.10 ```

Delete

Deleting a row or rows requires specifying a where keyword argument like in select and update (if it is not given, all rows are deleted).

```python Deleting a row

tbl.delete(where=’height<70’) tbl.select() name age height id 2 Ben Bitdiddle 24 70.1 ```

Drop

Finally, we can drop a table if we so desire. Of course, this means we can’t interface with it afterwards unless we explicitly create it again:

```python Drop a table

tbl.drop() tbl.select() Traceback (most recent call last): File “", line 1, in File "dbtools/table.py", line 339, in select cur.execute(*cmd) sqlite3.OperationalError: no such table: People ```

Other stuff

Currently there isn’t too much else, except that I’ve included a save_csv method which will call select and then save the result to a csv file (using the DataFrame.to_csv method). I’m sure I’ll come up with other helper methods in the future, and am open to suggestions.