We needed an accounting solution for our dormitory with 320 users and around 900 computers, each 100MBit, and 1 GBit outgoing in total. All traffic is lead through a firewalling & NAT server running Debian GNU/Linux with Kernel 3.2. We are only interested in the total amount of traffic each individual host generates. As the IPs are distributed via dhcp, we know which user has which hosts (with IP and MAC). Users could spoil the system with spoofed MACs, but we trust them (and after all, bandwidth itself is not such a problem). We only want to account traffic to find users that generate much, much more traffic than the others to do some kind of shaping to stay “fair”. And everything should respect as much privacy as possible.
So all in all the requirements were
- Track total traffic at a best effort level
- Account per IP (or MAC)
- Do not track (store) connections
- Fast (as usual)
- Directly write to MySQL
The old solution
The former solution was quite clever but very hard to understand and maintain. Also it mixed firewall functionality with accounting. The basic idea is to use iptables’ ability to account traffic per chain/ target. This is explained in several places. The problem with all these is the overhead of a linear search if you want to distinguish single IPs. Therefor a Perl script was written to generate a tree of chains that leads to a single target per host separately for incoming and outgoing traffic for campus and internet traffic. Also each host rule checks whether IP and MAC match.
This might look like
-A tree0i -d 10.10.0.1/255.255.248.1 -j tree1i -A tree0i -d 10.10.0.2/255.255.248.2 -j tree2i -A tree0i -d 10.10.0.4/255.255.248.4 -j tree4i -A tree0i -d 10.10.0.8/255.255.248.8 -j tree8i [...] -A tree1i -d 10.10.0.2/255.255.248.2 -j tree3i -A tree1i -d 10.10.0.4/255.255.248.4 -j tree5i -A tree1i -d 10.10.0.8/255.255.248.8 -j tree9i [...] -A tree5i -d 10.10.0.8/255.255.248.8 -j tree13i -A tree5i -d 10.10.0.16/255.255.248.16 -j tree21i [...] -A tree5i -d 10.10.0.128/255.255.248.128 -j tree133i -A tree5i -d 10.10.3.5 -s 10.0.0.0/8 -j intrain -A tree5i -d 10.10.3.5 -s 184.108.40.206/16 -j intrain -A tree5i -d 10.10.3.5 -j wanin -A tree5i -d 10.10.2.5 -s 10.0.0.0/8 -j intrain -A tree5i -d 10.10.2.5 -s 220.127.116.11/16 -j intrain -A tree5i -d 10.10.2.5 -j wanin
for incoming traffic and
-A tree20o -s 10.10.2.20 -m mac --mac-source ! \ xx:xx:xx:xx:xx:xx -j REJECT -A tree20o -s 10.10.2.20 -d 10.0.0.0/8 -j intraout -A tree20o -s 10.10.2.20 -d 18.104.22.168/16 -j intraout -A tree20o -s 10.10.2.20 -j wanout
for outgoing. This leads us to a iptables -L -n -v of
... Chain tree97o (1 references) pkts bytes target prot opt in out source destination 7776 613K tree225o all -- * * 10.10.0.128/255.255.248.128 0.0.0.0/0 0 0 REJECT all -- * * 10.10.1.97 0.0.0.0/0 MAC ! xx:xx:xx:xx:xx:xx reject-with icmp-port-unreachable 0 0 intraout all -- * * 10.10.1.97 10.0.0.0/8 0 0 intraout all -- * * 10.10.1.97 22.214.171.124/16 0 0 wanout all -- * * 10.10.1.97 0.0.0.0/0 0 0 REJECT all -- * * 10.10.2.97 0.0.0.0/0 MAC ! xx:xx:xx:xx:xx:xx reject-with icmp-port-unreachable 18 912 intraout all -- * * 10.10.2.97 10.0.0.0/8 7687 718K intraout all -- * * 10.10.2.97 126.96.36.199/16 62749 3669K wanout all -- * * 10.10.2.97 0.0.0.0/0 0 0 REJECT all -- * * 10.10.4.97 0.0.0.0/0 MAC ! xx:xx:xx:xx:xx:xx reject-with icmp-port-unreachable 0 0 intraout all -- * * 10.10.4.97 10.0.0.0/8 0 0 intraout all -- * * 10.10.4.97 188.8.131.52/16 0 0 wanout all -- * * 10.10.4.97 0.0.0.0/0 0 0 REJECT all -- * * 10.10.3.97 0.0.0.0/0 MAC ! xx:xx:xx:xx:xx:xx reject-with icmp-port-unreachable 26 1352 intraout all -- * * 10.10.3.97 10.0.0.0/8 28814 2426K intraout all -- * * 10.10.3.97 184.108.40.206/16 15845 841K wanout all -- * * 10.10.3.97 0.0.0.0/0 ...
where one can now extract the number of bytes. In this case even distinguished between campus and internet traffic (which is no more necessary). The extraction is done every 20 minutes and takes a while and when the firewall has to be restarted. This is always necessary when you add or remove a computer as the tree has to be rebuild. Afterward counters are zeroed with iptables -Z. The number of bytes is written to a MySQL table.
The main problem with this solution is complexity. To understand the Perl code generating the tree you always again need at least half an hour for even most simple changes. Also the generated rule set is at best rudimentary understood by new internet tutors and in no way accessible when looking for problems in the firewall. Also you cannot enable
-A FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT
so every package has to traverse the hole tree (which doesn’t seem to be a problem performance wise).
Ideas for a new solution
The new firewall shall use the great ipset tool together with a mac-ip bitmap (more on this in another post) and separate the accounting from the firewall.
Many possibilities where checked. Most of them somehow mirror ip packages into userspace, extract the header and write it into a database. The 2 main possibilities of mirroring are promiscuous mode+pcap, which is slow as hell and generates a lot of load, and the new netlink socket of recent kernels. One might use the ulog target to mirror only the first x bytes of each packet into userspace. This can also be done in an accumulated way in batches of e.g. 10 packets. The nice thing about this is a fast transfer to userspace. In case of a very high userspace load packets are still routed through the firewall (but not accounted). This is compatible with our best effort requirement. On userspace side usually ulogd listens for these packages. Just recently ulogd version 2 was released but isn’t even in Debian experimental. Nevertheless there are solutions working with ulogd version 1.
I found two examples: ulogd accounting, which again had the problem with a linear iptables ruleset, and ulacc, an ulogd plugin (which is somehow dated and was written for an ulogd version <1). The idea is to extract the MAC, source and destination IP and size from the mirrored packet(-fragments), aggregate traffic in memory per MAC for some time and write it into a text file when being signaled. This could easily be extended with a MySQL functionality. I ported this plugin to a recent ulogd1 and tried it several days. The load was high but doable for our (otherwise pretty underemployed) firewalling host, even at GBit speed. But it kept crashing randomly, leading to an ulogd with 100% cpu usage. As I couldn’t find the reason I put it aside and tried something different.
Another solution is an aggregation per IP right in kernel space with ipt_account. It looks promising but I didn’t want to compile a kernel module for each and every kernel security update.
Conntrack based solution
There already is a kernel service that has to keep track of connections going through the NAT, conntrack. Harald Welte extended this some years ago to also count packets and bytes per connection. This first has to be enabled with
/bin/echo "1" > /proc/sys/net/netfilter/nf_conntrack_acct
shows you what’s happening in your NAT and includes the byte counter. This information again goes through a AF_NETLINK socket and is event based. A program registers for some kind of events. We are interested in DESTROY events, closed connections. One might track these with
conntrack -e destroy
This is great and reduces the amount of traffic between kernel and userspace by several orders of magnitude. The only problem here is some latency as single tcp connections might be open for hours or days (compare /proc/sys/net/*timeout*). One could account for this by also listening for NEW and UPDATE events but this is not necessary in our scenario of a best effort based accounting system.
Backporting ulogd 2
Okay, this all sounds very nice, but how do I use this? Yeah… Like I said, there is ulogd2 which also depends on some new libraries which are not in Debian stable either. But as we already had a backports chroot I took them from testing and build some squeeze packages. (For ulogd2 I found debian build scripts on this site. There is also an ITP page. I had to change some bits, so the debian directory is also on the linked website.)
Using ulogd 2
ulogd2 is a very powerful wrapper around some input, filter and output plugins. It’s documentation is complete but a bit hard to follow. Therefor I will explain our solution in quite some detail. We are interested in the inpflow_NFCT.so flow input and the output_MYSQL.so output. Also because of some MySQL limitations we need filter_IP2BIN.so. I also looked into the inpflow_NFACCT.so Plugin, but this again is not what we are looking for. A stripped down ulogd.conf might look like
[global] logfile="/var/log/ulogd.log" loglevel=1 plugin="/usr/lib/ulogd/ulogd_inpflow_NFCT.so" plugin="/usr/lib/ulogd/ulogd_filter_IP2BIN.so" plugin="/usr/lib/ulogd/ulogd_output_MYSQL.so" # this is a stack for flow-based logging to MySQL stack=ct1:NFCT,ip2bin1:IP2BIN,mysql2:MYSQL [ct1] #netlink_socket_buffer_size=217088 #netlink_socket_buffer_maxsize=1085440 #netlink_resync_timeout=60 # seconds to wait to perform resynchronization #pollinterval=10 # use poll-based logging instead of event-driven event_mask=0x00000004 # only listen to DESTROY events [mysql2] db="FIXME" host="FIXME" user="FIXME" table="FIXME" pass="FIXME" procedure="INSERT_BYTES"
This looks quite easy. One might play with the buffers in [ct1] but I found that this isn’t necessary for a decent performance. The only important setting are the event_mask and the procedure. But as you can see there’s no magic here at all. In a default setup this would write all destroyed connections into the db table which would be a nightmare for privacy advocates.
The corresponding MySQL table
The accounting magic now lies in the layout of the MySQL table and the INSERT_BYTES MySQL function. First the layout. Ulogd2 uses the column names of the table to decide what to fetch from the connection and write into the table (or in our case to send to the MySQL function). For accounting to work we need the orig_saddr (which is the IP of the calling host), the reply_saddr (which is the IP of the answering host) as well as the byte counts for outgoing and incoming traffic. Also you need a connection id column.
All in all our table looks like
DROP TABLE IF EXISTS `ulogd_acc`; CREATE TABLE IF NOT EXISTS `ulogd_acc` ( `_ct_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `orig_ip_saddr_bin` binary(16) NOT NULL COMMENT 'lokale IP', `reply_ip_saddr_bin` binary(16) DEFAULT NULL COMMENT 'muss vorhanden sein (fuer ulogd)', `orig_raw_pktlen` bigint(20) NOT NULL DEFAULT '0' COMMENT 'hier als outbytes verwendet', `reply_raw_pktlen` bigint(20) NOT NULL DEFAULT '0' COMMENT 'hier als inbytes verwendet', PRIMARY KEY (`_ct_id`), UNIQUE KEY `orig_ip_saddr_bin` (`orig_ip_saddr_bin`) ) ENGINE=MEMORY DEFAULT CHARSET=latin1;
Notice the UNIQUE KEY! This will be important later. Also this is a MEMORY table to avoid a steady stream onto our hard disk. We don’t want use this raw data anyway (see below).
DELIMITER $$ DROP FUNCTION IF EXISTS `BIN_TO_IPV4`$$ CREATE FUNCTION `BIN_TO_IPV4`( _in binary(16) ) RETURNS varchar(64) CHARSET latin1 DETERMINISTIC SQL SECURITY INVOKER COMMENT 'Convert binary ip to printable string' BEGIN -- IPv4 address in IPv6 form IF HEX(SUBSTRING(_in, 1, 12)) = '00000000000000000000FFFF' THEN RETURN CONCAT( ASCII(SUBSTRING(_in, 13, 1)), '.', ASCII(SUBSTRING(_in, 14, 1)), '.', ASCII(SUBSTRING(_in, 15, 1)), '.', ASCII(SUBSTRING(_in, 16, 1)) ); END IF; -- return the full IPv6 form RETURN NULL; END$$ DROP FUNCTION IF EXISTS `BIN_TO_IPV6`$$ CREATE FUNCTION `BIN_TO_IPV6`( _in binary(16) ) RETURNS varchar(64) CHARSET latin1 DETERMINISTIC SQL SECURITY INVOKER COMMENT 'Convert binary ip to printable string' BEGIN -- IPv4 address in IPv6 form IF HEX(SUBSTRING(_in, 1, 12)) = '00000000000000000000FFFF' THEN RETURN CONCAT( '::ffff:', ASCII(SUBSTRING(_in, 13, 1)), '.', ASCII(SUBSTRING(_in, 14, 1)), '.', ASCII(SUBSTRING(_in, 15, 1)), '.', ASCII(SUBSTRING(_in, 16, 1)) ); END IF; -- return the full IPv6 form RETURN LOWER(CONCAT( HEX(SUBSTRING(_in, 1, 2)), ':', HEX(SUBSTRING(_in, 3, 2)), ':', HEX(SUBSTRING(_in, 5, 2)), ':', HEX(SUBSTRING(_in, 7, 2)), ':', HEX(SUBSTRING(_in, 9, 2)), ':', HEX(SUBSTRING(_in, 11, 2)), ':', HEX(SUBSTRING(_in, 13, 2)), ':', HEX(SUBSTRING(_in, 15, 2)) )); END$$ DELIMITER ;
These might be used in a view
CREATE ALGORITHM=UNDEFINED SQL SECURITY INVOKER VIEW `ulogd_acc_ipview` AS select `ulogd_acc`.`_ct_id` AS `id`,`BIN_TO_IPV4`(`ulogd_acc`.`orig_ip_saddr_bin`) AS `ipv4`,`BIN_TO_IPV6`(`ulogd_acc`.`orig_ip_saddr_bin`) AS `ipv6`,`ulogd_acc`.`orig_raw_pktlen` AS `out_bytes`,`ulogd_acc`.`reply_raw_pktlen` AS `in_bytes` from `ulogd_acc`;
And finally the magic part, the INSERT_BYTES function:
DELIMITER $$ DROP FUNCTION IF EXISTS `INSERT_BYTES`$$ CREATE FUNCTION `INSERT_BYTES`( `_orig_ip_saddr` binary(16), `_reply_ip_saddr` binary(16), `_orig_bytes` bigint, `_reply_bytes` bigint ) RETURNS bigint(20) unsigned BEGIN -- convert binary ip string to upper case hex, convert it to decimal base and cast it to int -- now we can AND it with the subnet mask (255.255.248.0) and compare it with our -- subnet (10.10.0.0) -- phew! IF (cast(conv(hex(_orig_ip_saddr),16,10) as unsigned integer) & 0xfffff800)=0x0a0a0000 THEN INSERT INTO ulogd_acc(orig_ip_saddr_bin,orig_raw_pktlen,reply_raw_pktlen) values (_orig_ip_saddr,_orig_bytes,_reply_bytes) ON DUPLICATE KEY UPDATE orig_raw_pktlen=orig_raw_pktlen+_orig_bytes, reply_raw_pktlen=reply_raw_pktlen+_reply_bytes; ELSEIF (cast(conv(hex(_reply_ip_saddr),16,10) as unsigned integer) & 0xfffff800)=0x0a0a0000 THEN INSERT INTO ulogd_acc(orig_ip_saddr_bin,orig_raw_pktlen,reply_raw_pktlen) values (_reply_ip_saddr,_reply_bytes,_orig_bytes) ON DUPLICATE KEY UPDATE orig_raw_pktlen=orig_raw_pktlen+_reply_bytes, reply_raw_pktlen=reply_raw_pktlen+_orig_bytes; END IF; RETURN LAST_INSERT_ID(); END$$ DELIMITER ;
Okay, what am I doing here? First I check whether the connection was initiated by a host inside our network (which is 10.10.0.0/255.255.248.0). If yes, the source IP is inserted into the table together with the in and out bytes. If the IP is already known, we add the bytes. The ON DUPLICATE KEY UPDATE only works because we defined orig_ip_saddr_bin as unique before.
The second rule is only necessary because we have some tunneled ports and DNAT hosts. So if a connection is initiated from somewhere else, but the answer comes from a local host, we now have to use reply_ip_saddr_bin as the local address and swap in for out bytes. The return statement seems to be necessary to make ulogd happy (which I don’t understand after reading its sourcecode).
Let’s look at this strange condition. binary in (My)SQL is a string type, consisting of chars. These first have to be converted into some real hex representation (which more or less only means to convert lower case letters to upper case). This now can be converted from hex to dec and finally casted into an integer. On this integer we can then use subnet logic to see whether the host is from our subnet or not. In this case we can just ignore the ffff of the ipv4 in ipv6 representation as it is swallowed by the logical AND anyway.
Yay. That’s it. The only missing piece is some script to collect the data from the memory table and write back the traffic per month. Afterward one has to zero the counters in the memory table. I do this via deleting the hole table including a reset of the autoincrement as this one is incremented with every destroyed connection. As it’s a bigint this is not really a problem, but nevertheless. You can’t truncate a table when it is locked because of a bug in MySQL. Therefor this is a bit more complicated :-). The main part of the Python script looks like
sql_pull = ''' select `BIN_TO_IPV4`(`ulogd_acc`.`orig_ip_saddr_bin`), `ulogd_acc`.`orig_raw_pktlen`, `ulogd_acc`.`reply_raw_pktlen` FROM ulogd_acc; ''' sql_lock = ''' LOCK TABLE ulogd_acc WRITE; ''' sql_truncate = ''' DELETE FROM ulogd_acc; ''' sql_alter = ''' ALTER TABLE ulogd_acc auto_increment=1; ''' sql_unlock = "UNLOCK TABLES;" sql_account = ''' INSERT INTO accounting(ip,date,monat,jahr,wanin,wanout) VALUES (%s,CURDATE(),MONTH(NOW()),YEAR(NOW()),%s,%s) ON DUPLICATE KEY UPDATE date=CURDATE(), wanin=wanin+%s, wanout=wanout+%s; ''' db_pull = MySQLdb.connect(host=config.RSYS_HOST, user=config.RSYS_USER, passwd=config.RSYS_PASS, db=config.RSYS_NAME, use_unicode=True) db_pull.autocommit(True) cursor=db_pull.cursor() cursor.execute(sql_lock) cursor.execute(sql_pull) computer=cursor.fetchall() cursor.execute(sql_truncate) cursor.execute(sql_alter) cursor.execute(sql_unlock) for row in computer: cursor.execute(sql_account,(row,row,row,row,row)) cursor.close() db_pull.close()
The accounting table looks like
DROP TABLE IF EXISTS `accounting`; CREATE TABLE IF NOT EXISTS `accounting` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `ip` varchar(20) NOT NULL DEFAULT '', `date` date NOT NULL, `monat` int(11) NOT NULL DEFAULT '0', `jahr` int(11) NOT NULL DEFAULT '0', `wanin` bigint(20) unsigned NOT NULL DEFAULT '0', `wanout` bigint(20) unsigned NOT NULL DEFAULT '0', PRIMARY KEY (`id`), UNIQUE KEY `ipdatetuple` (`ip`,`monat`,`jahr`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 PACK_KEYS=1;
Important is again the unique-statement!!!
That’s all folks
After 3 month of looking around, installing, porting, packaging, trying etc. this seems to be a great solution. The overall number of events is pretty low because conntrack even tracks corresponding ping events. The memory table keeps the load down. Even for thousands of destroyed connections (think of bittorrent…) there is no noticeable cpu or io load. Because of the use of AF_NETLINK accounting is best effort. Netfilter keeps doing, even if userspace would be busy. Accounting is completely separated from the firewall. You don’t need any iptables rule at all. The only contra to this solution is the effort to backport the debian packages in order to get a seamless experience when upgrading to wheezy.