Alex Kreidler

ProjectsBooksBlog

Blog

Small Projects

Often I wonder if a project is too small to publish. *This tool will only be useful for me*, I think. But if you open source the project, others might find it useful, and they might request features or fix bugs to make it better for everyone. Even if no one uses it but yourself, a published project accessible at a public domain is way easier to use than having to spin up a local dev server. This also minimizes dependency rot, e.g. getting errors or depracation warnings when running an old project with a new version of Node. A few of my projects are like that: - **[SuperCalc](https://supercalc.alexkreidler.com/)** - A multi-line/expression calculator built on Math.js inspired by Numi. - **[OntologySearch](https://ontologysearch.netlify.app/)** - A web search tool for ontologies. - **[table-annotate](https://table-annotate.alexkreidler.com/)** - A tool to annotate tabular datasets with links to wikidata entities. (though to be honest I haven't used the bottom two recently as I've done less semantic web stuff). This week, I published two more small projects: - **[Font Playground](https://font-playground.alexkreidler.com/)** - A tool to help you pick fonts by comparing them with visually similar sizes. - **[DataBox](https://databox.alexkreidler.com/)** - A data toolbox to run DuckDB queries in your browser. (Useful for testing queries on remote Parquet files or huggingface datasets, or running geospatial queries) Note these are all static single-page-apps that don't need a backend server, so you can host them for free, forever, on Netlify or Cloudflare Pages. But you can even extend this to most traditional backends that request data from APIs or even some databases (like SQLite), by deploying them to a serverless platform like Workers. Only compute intensive tasks like heavy analytical queries or ML jobs can't be deployed in this way.

Nov 1, 2024

2 min read

Cool commands

Here are some commands/snippets I've found useful. ### IP/Connection/TLS info ```bash curl https://ipinfo.io/json # Response like: { "ip": "34.77.243.238", "hostname": "238.243.77.34.bc.googleusercontent.com", "city": "Brussels", "region": "Brussels Capital", "country": "BE", "loc": "50.8505,4.3488", "org": "AS396982 Google LLC", "postal": "1000", "timezone": "Europe/Brussels", "readme": "https://ipinfo.io/missingauth" } curl https://tls.peet.ws/api/all # For TLS fingerprint data ``` ### 1-line installs for Linux/Mac - Brew: `/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"` - Rust: `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh` - [Micromamba](https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html): `curl micro.mamba.pm/install.sh | bash` - [Pixi](https://prefix.dev/docs/pixi/overview#installation): `curl -fsSL https://pixi.sh/install.sh | bash` - [Tailscale](https://tailscale.com/kb/1031/install-linux): `curl -fsSL https://tailscale.com/install.sh | sh` - Bun: `curl -fsSL https://bun.sh/install | bash; source ~/.bashrc` - DuckDB: `wget -qO- https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip | funzip | sudo tee /usr/local/bin/duckdb > /dev/null && sudo chmod +x /usr/local/bin/duckdb` - Fish: `brew install fish` ### Add Apache-2.0 license to project ```bash curl -O https://www.apache.org/licenses/LICENSE-2.0.txt ``` (Remember to update the license field in your package manifest file like `package.json` or `Cargo.toml`) ### Create an SOCKS proxy over SSH ```bash ssh -NCD8080 # -N just forwards the ports, doesn't run a command # -C compresses data # -D creates "dynamic" port forwarding by running a SOCKS proxy on your machine tunelling to the remote machine. ``` Then use `HTTP_PROXY`, `HTTPS_PROXY`, or `ALL_PROXY=localhost:8080` to use your proxy. ### Detect/convert character encodings [Chardet](https://github.com/chardet/chardet) is a Python CLI tool great at detecting encodings, and iconv is a standard GNU Linux tool that converts between encodings. ```bash chardet myfile.txt iconv -f <detected encoding> -t UTF8 myfile.txt > myfile-utf8.txt ``` ### Firefox browser history with specific visit datetimes Run this in DuckDB on a copy of `places.sqlite` from your Firefox Profile directory (a randomly-named subdirectory of `~/Library/Mozilla/Firefox/Profiles/` on Mac or `%APPDATA%\Mozilla\Firefox\Profiles\` on Windows) ```sql CREATE OR REPLACE VIEW history AS SELECT title AS page_title, make_timestamp(visit_date) AS visit_time, url AS webpage_url, (CASE visit_type WHEN 1 THEN 'Link' WHEN 2 THEN 'Typed URL' WHEN 3 THEN 'Bookmark' WHEN 4 THEN 'Embedded' WHEN 5 THEN 'Permanent Redirect' WHEN 6 THEN 'Temporary Redirect' WHEN 7 THEN 'Download' WHEN 8 THEN 'Iframe' WHEN 9 THEN 'Reload' ELSE 'Other' END) AS visit_type, visit_count AS visit_count, make_timestamp(last_visit_date) AS last_visit_time, from_visit AS visit_source FROM moz_historyvisits, moz_places WHERE moz_historyvisits.place_id = moz_places.id ORDER BY moz_historyvisits.visit_date DESC; ```

Oct 25, 2024

2 min read

Building a virtual phone: VoIP, SIP, modems, and the AT protocol

Let’s say you want to get a VoIP (Voice over IP) phone that can make and receive calls to regular 10-digit telephone numbers on the Public Switched Telephone Network (PSTN). Why? Maybe because you want advanced conference calling, call recording, or phone menu features, or you just want to be able to call people from your laptop or other devices. You’ll need the Session Initiation Protocol (SIP), a protocol somewhat similar to HTTP but for voice and video calls over the Internet, to connect your client (software that allows you to dial a number and use the audio input/output from your computer to make the call) to a gateway which translates messages received via SIP to real calls on the PSTN. Now how can we do this in the cheapest way possible? There are a few options: purchasing a SIP trunking service (from a company that runs a SIP server that connects through their PSTN gateway, managing it for you), which usually costs around 1 cent per minute. Or you can try some of the open-source software that allows you to create a SIP gateway out of a USB modem. Finally, you could build it yourself by connecting SIP to a cellular modem with a SIM card. ## Option 1 (buying): You pay for each number you reserve, plus per call minute. I’ve only included prices for outbound calls. [https://voip.ms/residential/pricing](https://voip.ms/residential/pricing) Calling: $0.01/min => 0.60/hour Number: $0.40 setup + $0.85/month [https://sonetel.com/en/prices/international-calls/](https://sonetel.com/en/prices/international-calls/) Calling: $0.011/min => 0.66/hour Number: $2/month They also have a $14/month premium plan that comes with 1 free number reserved. You’d need to spend more than 18 hours calling to make it worthwhile. [https://www.plivo.com/sip-trunking/pricing/us/](https://www.plivo.com/sip-trunking/pricing/us/) Calling: $0.0065/min => 0.39/hour Number: $0.5/month ## Option 2 (use SIP software with modem support): [Chan_dongle](https://github.com/wdoekes/asterisk-chan-dongle) is a plugin for Asterisk that supports Huawei modems. [Chan_dongle_extended](https://github.com/garronej/chan-dongle-extended) is a fork that has worked on fixing some bugs. I haven’t tried any of these personally. Asterisk is a large piece of software that I’ve heard can be complex to configure and get running, so I haven’t tried this. ## Option 3 (building): You buy a modem that supports a SIM card or eSIM, buy the card itself, and write the code to connect SIP to the modem. Computers communicate with modems via the AT serial protocol which describes basic commands. They often use [other proprietary protocols for data/network connections](https://www.ofmodemsandmen.com/protocol.html), but just for seting up voice calling AT is sufficient. Modems then use GSM, GPRS, and now LTE standards developed by the 3GPP partnership to communicate with cell towers. The main problem is most modems are very poorly documented and have subtly varying edge-cases between models. You may need to dig through datasheets like [this](https://forums.ni.com/ni/attachments/ni/170/52200/1/ConnexantAT.PDF) or [this](https://support.usr.com/support/3500/3500-files/3500-at-ig.pdf) or [this one that has a nice walk-through](https://web.archive.org/web/20110714011104/http://www.m2m-platforms.com/data/1vv0300767_UC864-E_Software_User_Guide_Final_DRAFT.pdf). If you got an eSIM that supports voice calling (which there are few of), and one that was very cheap, like $3/7days, then you might be able to run a SIP gateway for cheaper than the 3rd-party services listed above in Option 1. $3/7days is 0.42/day = 0.01478/hour = 0.0002976/min, or about 1/26th of Plivo's rate. But that's assuming you make calls 24/7, which you probably wouldn't, and it excludes the fixed cost of the modem hardware and server, like a Raspberry Pi. Also, you'd need to set up eSIM profile management, which can be complex -- Ubuntu's modem-manager currently doesn't support that. There are some cool libraries that could help you implement a SIP gateway, like [pyVoIP](https://pyvoip.readthedocs.io/en/v1.6.5/). Github issues indicate it has pretty bad sound quality and a fair number of bugs. [Viska](https://github.com/Televiska/viska) is a WIP rust library for SIP. It's not ready for production use yet, but it's cool to see someone working on this. ## SIP and AT protocol example The SIP protocol can be very complex, supporting multiple users and streaming formats. Here’s an example. Let’s say Alice wants to call Bob. ``` INVITE sip:[email protected] SIP/2.0 Via: SIP/2.0/UDP alicepc.example.com:5060 Max-Forwards: 70 From: <sip:[email protected]>;tag=1234 To: <sip:[email protected]> Call-ID: 5678 CSeq: 1 INVITE Contact: <sip:[email protected]> Content-Type: application/sdp Content-Length: 147 v=0 o=alice 2890844526 2890844526 IN IP4 alicepc.example.com s=- c=IN IP4 alicepc.example.com t=0 0 m=audio 49170 RTP/AVP 0 a=rtpmap:0 PCMU/8000 ``` The SDP body contains the session description: - **v**: Protocol version. - **o**: Originator of the session (Alice), including the session ID, version, network type, and address. - **s**: Session name or description. - **c**: Connection data, including the network type and address. - **t**: Timing information, indicating the start and stop time of the session. - **m**: Media information, such as audio, specifying the port number, transport protocol, and payload type. - **a**: Attributes, like the codec used (PCMU) and its clock rate (8000 Hz). This SIP session initiates a call from Alice to Bob, and the SDP provides details about the media parameters and capabilities for the call. | SIP Command | Purpose | AT Command | Purpose | |--------------------------------------------------|----------------------------------------------|------------------------------------------------|------------------------------------| | INVITE | Initiates a session setup request. | ATD (Dial) | Initiates a call | | BYE | Terminates a session. | ATH (Hang-up) | Hangs up an active call | ## Conclusion Overall, I realized that the cost of the modem hardware and SIM card, along with the time and complexity of building a simpler SIP gateway (even with some promising libraries) are too significant. It's much easier just to connect one of the many great SIP clients to a solid service like Plivo so I only pay for the minutes I use. But it's been a fun journey diving all the way from the user experience and networking layer of voice calls to the hardware, and trying to think about how to improve the system, for a cheaper and better calling experience.

Oct 14, 2023

5 min read

Where are public companies incorporated?

All public companies in the US are required to file annual and quarterly reports with the SEC, which usually detail its history, products, acquisitions, financial statements, and legal and regulatory risks. [^1] The SEC has developed an API on top of their EDGAR filing database to give information on the reporting companies and their filings. [^1]: https://www.sec.gov/education/capitalraising/building-blocks/what-does-it-mean-be-a-public-company There's a dataset of publicly listed companies: https://www.sec.gov/files/company_tickers.json And there's another API to get metadata about the company and its latest filings: https://data.sec.gov/submissions/CIK0000320193.json I downloaded all 10,885 companies and their metadata and extracted the most relevant information into a CSV file, which I [uploaded to Github Gists](https://gist.github.com/alexkreidler/1484e8498c12268455a3b3807e7606da). 41% of NYSE and NASDAQ companies are incorporated in Delaware, and 44% of all companies in the list (which includes those exchanges plus CBOE, OTC, and some unlabeled exchanges) are incorporated in Delaware. Here are some visualizations of the data: ![Alt text](/images/blog/sec/company_filing_categories.svg) ![Alt text](/images/blog/sec/company_exchange.svg) ![Alt text](/images/blog/sec/company_entity_type.svg) There's a large range of industry categories, and some categories like Deep Sea Foreign Transportation of Freight and American Depository Receipts had only [non-operating companies](https://www.sec.gov/info/edgar/edgartaxonomies). ![Alt text](/images/blog/sec/sicCategory_entity_type.svg)

Oct 1, 2023

1 min read

How to Travel More for Less Money

I love to travel, but it's expensive. For relatively cheap, I've been lucky enough to spend significant time in some of the best cities in the world: Madrid, Barcelona, New York, and San Francisco. I've also visited the mountains of Switzerland, lakes in Nothern Italy, canals of Amsterdam, and fields in Ireland. Here are some of my tips for getting the best value from your trips. First tip: stay for as long as you can in each city. Firstly, you'll just be able to get a better sense of the place, learn your way around, and figure out your favorite spots. But you'll also find little hacks and the best value places. Plus, if it's over a month, you can get cheaper student housing. ## Housing Use [Airbnb](https://airbnb.com) and [Booking.com](https://Booking.com) for travel in the US, but find the sites that most locals use in other countries, for example [Badi](https://badi.com) or [Idealista](https://idealista.es) in Spain. Also, there are often many temporary/student housing options that are definitely worth it but need a little investigation to find. Try craigslist in the US. Look for housing that language programs abroad have access to, sometimes it will be cheaper to join a program and get a discount on housing. ## Flights I use Google Flights to search for flights, and often will check if it cheaper to book two one-way tickets or try to create my own multi-city trip through hub cities. Also consider taking a bus in the US or train in Europe to your final destination. Make sure whenever you book a flight, you sign up for a frequent flier/rewards program account to earn miles, and that you pay with a card that earns more points on travel purchases. Consider memorizing some hubs so you can quickly check your own multi-stop itinerary through them: - United: EWR, IAD, ORD, DEN, IAH, LAX, SFO - American: DFW, CLT, ORD, PHL, PHX, MIA, LAX, JFK, DCA - Delta: ATL, SEA, LAX, JFK, LGA, SLC, BOS, DTW (Detroit), MSP (Minneapolis-St. Paul) - Frontier: MCO, DEN, LAS, PHL, ORD, CLE, RDU - Tap: LIS, OPO - Iberia: MAD, BCN - Most European airlines have flights through LHR, CDG, FRA, AMS. Some United partners will fly you through a multitude of hubs, like Madrid to Geneva to Milan to Newark. I have not in recent memory bought individual multi-city tickets for US travel but I have found it useful for travel to Europe. ## Rewards Programs Many credit cards have rewards programs specifically for travel, and will earn 2x-5x points on every dollar you spend on various travel categories including flights, hotels, or transit, and which can be redeemed via their own portal or by transferring to airline or hotel rewards programs. If you spend in those categories, and are smart about redeeming your awards, this can deliver more value than simple cashback or other credit-card rewards programs. When earning points/purchasing flights, use [wheretocredit.com](https://www.wheretocredit.com/) to figure out which airline rewards program to credit your miles to. For example, if you fly on a United flight, you can credit your miles to United's MileagePlus program, or you can credit them to Singapore Airlines' KrisFlyer program, which might be a better deal. When redeeming points, use [pointsyeah.com](https://www.pointsyeah.com/) to find the cheapest (points-wise) flights on all airlines. This will include transfer bonuses from a credit card rewards program to an airline rewards program. ## Work space If your hotel or apartment doesn't have (good enough) wifi, there are always Starbucks (though they can be loud) or in Spain a local coffee shop on every corner. Also in major US cities Capital One Cafes are free to all and have reduced-price drinks if you have one of their cards. In San Francisco Expensify Lounge has a great view and nice free snacks. You have to show a lounge pass at the door, so you need to add a payment card and choose a plan (there is a $5/month option). Finally, friend can bring you into coworking spaces, or in some cities there are coworking events where builders who register get free access for a day. ## Food If staying for a while, consider leasing a place with a kitchen so you can buy food at a grocery store and cook to save money. Depending on your industry and the city you are in, there may be networking events (or hackathons) with free food. Different grocery chains are cheaper in different cities, so do a quick google, a local journalist has most likely compared them, but don't discount the convenience of walking to pick up groceries. ## Conclusion There are many great experiences you can enjoy with just a little money -- exploring parks and waterfronts, walking through a new neighborhood, or eating in a fancy part of town. Keep the above tips in mind, but don't be afraid to spend some extra money here and there. You are traveling after all, so have fun!

Aug 6, 2023

5 min read

The TikTok Algorithm

I’ve never personally used TikTok, but have heard of its legendary sway on people, with some users who scroll for hours getting messages “saying they should put the phone down” from the app itself.[@hernHowTikTokAlgorithm2022] It is so powerful in fact, that projects that offered a "TikTok for X" recommendation experience could disrupt other spheres, for example in searching/browsing the web, local events, generative AI art/videogames, and text or audio social media. The most in-depth paper on TikTok’s algorithm is “Analysis on the “Douyin (Tiktok) Mania” Phenomenon Based on Recommendation Algorithms” by Zhengwei Zhao[@zhaoAnalysisDouyinTiktok2021a], who published it while an undergraduate at Sun Yat-sen University in Guangzhou, China. <!-- TODO: cite https://www.linkedin.com/in/zhengwei-zhao-3651981ba/?originalSubdomain=cn --> A paper by authors at ByteDance titled “Monolith: Real Time Recommendation System With Collisionless Embedding Table” focuses more on the infrastructure components of model serving and training at scale.[@liuMonolithRealTime2022a] I recommend reading the papers in that order for a more in-depth view of the summary that follows. **Video features**: First, TikTok will create lots of ML features about each new piece of video content. It will run NLP algorithms on the description and tags, object detection on the video, and other models on the audio. Then it will classify content into **hierarchical interest groups**, with broad categories like Tech, Sports, or Entertainment containing specific categories like Chinese Football or Bundesliga. Features will also include data about the video creator, like their location and more.[@zhaoAnalysisDouyinTiktok2021a; @guoMultimodalRepresentationLearning2019a] **User features**: Separately TikTok will create features about the user, including based on the device details, geolocation, and a **social graph** of other TikTok users based on data from contacts. Finally, these features will also contain detailed data about all the users’ interactions, including watch time, sharing, likes, etc, at a very granular detail, for example, understanding which objects were in a frame while the user was watching a video, and when they exactly scrolled to the next one. Higher-level/computed features are also included, like “same_author_seen” or “same_tag_today” to prevent boredom, for example.[@smithHowTikTokReads2021] These user features are continuously updated, and time is an important component.[@chengICMEGrandChallenge2019] **Recall and ranking**: Next, TikTok uses **collaborative filtering** based on the hierarchical groups to recommend pieces of content viewed by similar users. For example if two users are both interested in Japanese rock climbing, and a video is well-received by the first, then it may be sent to the second user. It also filters content in a recall step by genres, topics, and popularity. Then this list is ranked using a formula that combines the output of specialized models, each designed to predict key metrics. Those models are **several terabytes** in size, and are based on DeepFM, or **deep factorization machines**.[@liuMonolithRealTime2022a] FMs are useful because:[@rendleFactorizationMachines2010] > In contrast to SVMs, FMs model all interactions between variables using factorized parameters. Thus they are able to estimate interactions even in problems with huge sparsity (like recommender systems) where SVMs fail. We show that the model equation of FMs can be calculated in linear time and thus FMs can be optimized directly. [This](https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db) is a good introduction to FMs with context on collaborative filtering.[@lundquistFactorizationMachinesItem2020] A large-scale hash table is used to store the feature embeddings so the pairwise/cross-encoder rankings can be computed effectively.[@liuMonolithRealTime2022a] **Ranking formula**: According to an internal TikTok document acquired by the NYTimes, the formula is roughly as follows:[@smithHowTikTokReads2021] > Plike x Vlike + Pcomment x Vcomment + Eplaytime x Vplaytime + Pplay x Vplay Where P is the predicted probability of a binary action, E is the predicted time spent, and V is the value or weight assigned to a given action.[@singhThereReallySecret2023] The Wall Street Journal experimentally found that time spent watching a video, number of repeat viewings, and whether the video was paused during playback were the most important metrics.[@TikTokRecommenderRevealed2021] **Batch Rollouts based on Interactions**: Pieces of content are not immediately eligible to be retrieved by any user. A video is first “seeded” to just one user, and then larger batches of users if the majority of users in the previous batch respond positively.[@zhaoAnalysisDouyinTiktok2021a] **Model drift and in-flight updates**: While users are scrolling, liking, and sharing, Tiktok has to incorporate this real time feedback about users’ changing interests or the quality of content into their models, otherwise they will make suboptimal predictions on key metrics and recommend the wrong content. TikTok transfers a sparse subset of model weights to the parameter servers every hour to keep models up to date, while continuing to serve requests.[@liuMonolithRealTime2022a]

Jul 1, 2023

4 min read

Unexpectedly Complex Problems

Most projects I admire are ones where the solution is simple and easy, usually due to the immense effort of the author in understanding and abstracting away the complexity of the underlying problem. Here are some examples of gnarly domains: [Peoples' names](https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/) [Dates, times, and timezones](https://yourcalendricalfallacyis.com/) and some [Hacker News comments showing it's even more complicated](https://news.ycombinator.com/item?id=18031214), and an [XKCD](https://xkcd.com/1061/) to cheer you up [Protocol buffers](https://buf.build/blog/protobuf-es-the-protocol-buffers-typescript-javascript-runtime-we-all-deserve/) [OAuth](https://www.nango.dev/blog/why-is-oauth-still-hard) Package managers: [General problems](https://medium.com/@sdboyer/so-you-want-to-write-a-package-manager-4ae9c17d9527), [SAT solving](https://news.ycombinator.com/item?id=21696363) - [in mamba the conda replacement](https://github.com/conda/conda-libmamba-solver/blob/main/docs/libmamba-vs-classic.md#retry-logic), [XKCD](https://xkcd.com/1987/) > “Dependency hell” is a colloquial term denoting the frustration resulting from the inability to install software due to complicated dependencies. From the review we conducted one cannot conclude that the problem is solved. - https://arxiv.org/pdf/2011.07851.pdf REST/HATEOS https://en.wikipedia.org/wiki/HATEOAS https://medium.com/@andreasreiser94/why-hateoas-is-useless-and-what-that-means-for-rest-a65194471bc8 https://www.mscharhag.com/api-design/hypermedia-rest https://www.linkedin.com/pulse/hateoas-question-nikhil-kolekar/ https://levelup.gitconnected.com/love-it-or-hateos-it-3f8d5844e736

Jun 1, 2023

1 min read

About This Site

I've rehauled this website from a simple placeholder to a portfolio site with a blog. This won't be a traditional blog with long or even coherent posts. I'm going to write small posts about any curiosity I come across, whether a new tool or dataset. I'll publish some things in a partial state, and I hope to continuously improve them. <!-- TODO --> <!-- I think I'll keep the dates of posts to be when they were originally written or I worked on that project --> If you have feedback, please email me at alex [at] this domain. Here's a question: Do you prefer inline links or formal citations? ## How it's made In the past I've used Pandoc, Eleventy, Gatsby, and NextJS to build these kinds of sites. I like Eleventy because it's simple, fast, and has [Collections](https://www.11ty.dev/docs/collections/) built-in. But this time I wanted to use React, so I can add interactive stuff later, and because I love Chakra UI, so I filtered [this list of SSGs](https://jamstack.org/generators/) by that constraint, but didn't find anything particularly great. I tried Next but was disappointed with 5-20 second "fast refreshes," so I switched to Vite with [`vike`](https://vike.dev/). I'm pretty happy with it so far. I got rid of the two client-side components I was using, `react-browser-frame` and `react-goodreads-shelf`, and now everything is [pre-rendered](https://vike.dev/pre-rendering) into plain HTML. I looked into islands architecture libraries like [iles](https://github.com/ElMassimo/iles) or [Tropical](https://github.com/bensmithett/tropical), and can imagine using them in the future to cut down on bundle sizes. But for now, I'm happy with the simplicity of this site. (Though I've realized Chakra UI CSS variables I'm not using are a sizeable portion of each HTML page now.) For the books page, I switched from Goodreads, which deprecated their API, to Literal a nice sleek app that [has an API](https://literal.club/pages/api) and is actively working on new feature requests. For the blog page, I'm using a bunch of unified plugins like this: ``` const pipeline = unified() .use(markdown) .use(remarkGfm) .use(remark2rehype) .use(rehypeCitation, {bibliography: data.bibliography, csl: "https://www.zotero.org/styles/chicago-note-bibliography"}) .use(rehypeSlug) .use(rehypeAutolinkHeadings) .use(stringify); ```

May 20, 2023

2 min read

My Favorite Software Tools

This is a small list of some of my favorite tech to use and things I think are cool, even if they are unstable but could be useful in the future. You'd also do well looking at my [list of Starred Repos on Github](https://github.com/alexkreidler?tab=stars). In the future, I'll try to add **why** I think this stuff is cool/better than the alternatives. ## Frontend - React - Vite - Typescript - Chakra UI - Immer with useImmer, Redux Toolkit or Rematch, Redux ## Backend - Postgres - PostgREST - https://postgrest.org/en/stable/ - FastAPI - https://fastapi.tiangolo.com - SQLModel - unifies Pydantic and SQLAlchemy schemas so you can use the former for FastAPI validation and OpenAPI generation and the latter for quickly building CRUD apps on any SQL database - https://sqlmodel.tiangolo.com ## Cool Stuff - Fela - atomic CSS compiler - https://fela.js.org/ - Tamagui - Styled-system-like CSS framework with premade components and atomic CSS compiler - https://tamagui.dev/ - PMTiles/Protomaps - can be served as Mapbox Vector Tiles (MVT) API via a serverless function, or used directly with a client-side JS library that integrates with Maplibre GL JS. https://protomaps.com/ - Vega - a JSON-based visualization grammar, a declarative format for creating, saving, and sharing interactive visualization designs. https://vega.github.io/vega/ - Braid - A synchronization protocol for HTTP, adding versioning and subscriptions - https://braid.org/ - Hydra - a vocabulary for hypermedia-driven Web APIs. https://www.hydra-cg.com/ - Github Stats - https://github-readme-stats.vercel.app/api?username=alexkreidler&count_private=true&show_icons=true - Buf [connect](https://github.com/bufbuild/connect-es) and [protobuf-es](https://github.com/bufbuild/protobuf-es), a better GRPC API code generator and JS Protobuf implementation[^1] respectively - GRPC Devtools - faster and more beautiful than the other one - https://github.com/iendeavor/grpc-devtools [^1]: See the list of issues here: https://buf.build/blog/protobuf-es-the-protocol-buffers-typescript-javascript-runtime-we-all-deserve/ <!-- Add AI stuff, Qdrant DB article --> ## Documentation - Docusaurus - TypeDoc ## Note-taking/Research Tools - Logseq - A non-linear note-taking app that centers around markdown, bullets, backlinks and block references. Includes programmatic queries to render notes with certain tags - https://logseq.com/ - Foam - A personal knowledge management and sharing system inspired by Roam Research, built on Visual Studio Code and GitHub - https://foambubble.github.io/foam/ - Dendron - A hierarchical note-taking tool that uses markdown files and VSCode - https://www.dendron.so/ - Zotero - A free, easy-to-use tool to help you collect, organize, cite, and share research - https://www.zotero.org/ ## Fonts - Great [list of font lists](https://www.creativelivesinprogress.com/article/free-fonts-typefaces-dazed-studio-jethro-nepomuceno) - Fontsource - https://fontsource.org/docs/getting-started/introduction - Fontshare - free fonts for commercial use. https://www.fontshare.com/ - Atipo Foundry - pay what you want fonts for commercial use, with full trial sets. https://www.atipofoundry.com/

May 20, 2023

3 min read

Tips on Writing and Reading

Writing and reading are sneakily intertwined. In the past few months, I've realized writing is a true gem of a tool that can help you improve your thinking -- questioning your assumptions, clarifying thoughts, and helping learn. Like many realizations of mine, they come from recent projects: a [literature review](/openie-datasete) of Open Information Extraction datasets, and a few thousand word paper/article on Levittown, New York, where I found historical research with 50-100 sources quite fun, and translating that into punchy, concise writing even more so. I've also been interested in tech related to writing. After writing [a paper](https://alexkreidler.github.io/loqu-paper/) about my project Loqu[^1], I built a proof of concept app [Paperweight](https://github.com/alexkreidler/paperweight), and after some recent school research built another POC [ScholarWrite](https://github.com/alexkreidler/scholarwrite). ## What should you read? **Books and longform journalism** are the most rewarding reads. Non-fiction books are highly informative, and the good ones tell a story. Longform articles are great stories and hopefully informative about useful subjects. I've made a [list of my favorite books](/books) and [Longform](https://longform.org/sections) has great articles organized by topic, while [Arts and Letters Daily](https://www.aldaily.com/) is a comprehensive list of magazines and some curated top articles of the week. Mortimer Adler makes a compelling case for reading **the classics** in _How to Read a Book_, and I heartily agree. They contain the most core knowledge on society--if only they weren't so hard to read! I believe the 1940 edition is best because it contains his original commentary on the declining state of reading and education in the US, where classics were replaced by synopses and summaries. ## Writing Tips **Be clear and concise.** As someone steeped in technical skepticism, I used to love loading a sentence with as many qualifiers as possible. Then I wouldn't be wrong because I'm acknowledging the uncertainty. But I could still be wrong. And people [love a good story, even if it's the less accurate view](/books#Superforecasting). So I'm writing like how I talk, a little punchier and more direct. **Good sources are an art.** In law, you prepend a "table of points and authorities" to every motion -- all the cases and statutes you cite/rely on. Much of your case relies on how well those sources support your claims. I feel that doing the extra research to provide the reader with the best possible resource you can find on a topic will leave them with a happy feeling and a desire to read more of your work. **Add humor.** The best blog posts are funny. <!-- It's a list of all the cases and statutes you cite in the motion. It's a good way to show you've done your research and to give the judge a quick way to see what you're relying on. I think it's a good idea to do something similar in blog posts. I'm not sure what the best way to do it is, but I'm trying to figure it out. --> <!-- Don't use a lot of words when you can use a few. Don't use a few words when you can use one. --> [^1]: The paper is a bit embarrassing to look at now but I guess that's the case with everything we make that's several years old

May 20, 2023

3 min read

A Survey of Open Information Extraction Datasets

Open Information Extraction (OIE or OpenIE) is a key task in Natural Language Processing (NLP) that aims to extract structured information from natural language, in tuple format, without relying on pre-specified relations. OIE has various applications such as text summarization, question answering, knowledge graph construction, and more.[@zhouSurveyNeuralOpen2022] OIE datasets play a critical role in advancing research in this area by providing annotated data for training and evaluation purposes. In this literature review, we focus on OIE datasets and their characteristics. Specifically, we examine five datasets, namely OIE2016, CaRB, Wire57, BenchIE, and DocOIE. We focus on several subtopics including challenges with evaluation metrics, recent dataset improvements, and advanced applications including coreference resolution and inference. We aim to illuminate the strengths and weaknesses of OIE datasets and how those factors may influence the field, to further the development of more accurate and effective OIE systems. ## Introduction First, we provide an overview of the task of triple and tuple extraction. A triple consists of a subject, predicate, and object. For example, in the sentence \"John works for Google,\" the subject is "John," the predicate is \"works for,\" and the object is "Google," and this could be written as (John; works for; Google). In OpenIE, the predicate is commonly called a relation, and the subject and object are called arguments, and there may be more than two arguments, for example (John; worked for; Google; in 2011). An n-dimensional extension of a triple is a tuple (the strange nomenclature comes from mathematics). OIE is called **open** information extraction because it uses an open schema, e.g. it does not have predefined relations and uses phrases from the original text. The first large-scale dataset for Open Information Extraction was the OIE2016 dataset.[@stanovskyCreatingLargeBenchmark2016] It used automatic rules to convert the QA-SRL (Question Answering-Semantic Role Labeling) dataset to tuples, covering WSJ and Wikipedia sources. Later work[@bhardwajCaRBCrowdsourcedBenchmark2019] found many bad annotations in OIE2016, and noted it may be because of the automated conversion. They introduce CaRB, a new OpenIE dataset created by giving 1,282 sentences from OIE2016 and annotation guidelines to crowd-workers on Amazon Mechanical Turk. The authors then annotate 50 sentences and find the matching score between that baseline and crowd-annotated CaRB data is higher than with the OIE2016 data, likely indicating better quality, which can be verified qualitatively (see Table 1 of Bhardwaj et al.). The literature has elaborated some key principles of OpenIE datasets: extractions should be asserted by the sentence, informative, and minimal/atomic, while collectively being exhaustive/complete -- covering the information in a sentence.[@stanovskyCreatingLargeBenchmark2016; @lechelleWiRe57FineGrainedBenchmark2019] Others have noted that these preferences may change based on the downstream NLP task.[@gashteovskiBenchIEFrameworkMultiFaceted2022] ## Challenges with Evaluation Metrics The phrase "you get what you measure" is apt for the field of machine learning, where we build systems to mimic the statistical output distribution to maximize an evaluation metric on a particular dataset. So the metric is almost as important as the dataset itself. Both OIE2016 and CaRB have highly flawed evaluation metrics based on simple token matching. The clearest illustration of the problem was provided by the creators of the WiRe57 dataset[@lechelleWiRe57FineGrainedBenchmark2019], who built a system dubbed Munchkin that simply outputs permutations of the original sentence. The creators of the BenchIE dataset[@gashteovskiBenchIEFrameworkMultiFaceted2022] also created a baseline that sets each verb as a predicate and the previous and remaining parts of the sentence as the subject and object arguments respectively. Both systems outperform real OpenIE systems when using the CaRB evaluation scorer, and the problem is likely exacerbated on OIE2016. The OIE2016 lexical match rewards long extractions very similar to the input sentence and doesn't take into account the ordered structure of the tuples. It works as follows: For each word in the gold triple, for each equal word in the evaluation triple, increment a counter. Then return a true match if the counter is greater than 25% of the length of the gold triple. One can see this is easily exploited -- for example by repeating the entire sentence multiple times, this would ensure every word in the gold triple (because they come from the sentence) is repeated multiple times -- more than it occurs in the gold triple and thus increasing the lexical match score to above 100%. CaRB changes the scoring algorithm to use *multi-match*, allowing a gold tuple to match multiple system extractions, and uses "token" matching at the tuple rather than sentence level. While this better captures the positional/structural information of a tuple, it still encourages long extractions. ## Recent Dataset Improvements Recognizing the same problem, the BenchIE and WiRe57 authors try different remedies. WiRe57 computes precision and recall scores for each part of the tuple, e.g. subject, predicate and object, and then sums them and normalizes by total length. BenchIE enumerates all possible representations of a given fact, and groups these into a "fact synset," and then computes precision and recall based on the number of matched distinct facts (where at least one representation has an exact match with the system's output). When a system extracts the same fact multiple times, in previous systems it would be rewarded. BenchIE is neutral to such behavior as long as the repeated extractions are in some synset. However in WiRe57, this is penalized by the denominator of the precision metric that measures length of predicted tuples, and the fact that a predicted tuple has a 1-1 relation with a reference tuple, enforced by greedy matching. Due to the robustness of the WiRe57 evaluation metric it would be interesting to apply it to other datasets such as CaRB, OIE2016, or DocOIE. However BenchIE does identify cases where their approach is better. For example, in cases where the extraction is factually incorrect/not implied by the sentence but has high lexical overlap with the gold triple, any fuzzy metric including the WiRe57 approach will fail, and similarly if the extraction is correct but has lower lexical overlap. The first case is the most common, while hopefully in the second case the relevant words in the gold triples have been marked as optional (which WiRe57 does but CaRB does not) and would have a high overlap. Both BenchIE and WiRe57 have demonstrated resulting F1 scores in the range of 0.2-0.35 for state of the art OpenIE systems, much lower than the range of 0.3-0.6 on CaRB. Since BenchIE is a subset of CaRB sentences with stricter and less flawed evaluation, and because both benchmarks from different datasets have F1 scores in the same range, this demonstrates that performance on the OpenIE task is significantly inflated by previous benchmarks and there is much room for improvement. ## Advanced applications: Coreference resolution and inference These two benchmarks may also enable more advanced capabilities. Coreference resolution is the task of finding mentions in a text of a given entity, for example resolving "she" to "Marta." Both benchmarks include optional coreference resolution tuples, which could be used to evaluate future systems that handle that task. WiRe57 also has optional light predicate inference tuples, e.g. inferring from (Tokyo; \[is\]; \[a prefecture\]) from "Tokyo ... is the capital city of Japan and one of its 47 prefectures." These benchmarks will likely remain relevant. DocOIE is the largest "expert-annotated" dataset, containing 800 sentences.[@dongDocOIEDocumentlevelContextAware2021] However the annotations appear to be lower quality, although this may be due to the complexity of the source data of patent documents. For example, given the sentence "Alert icon can be selected by the user at the remote station to generate an alert indicator" they annotate (alert icon selected by...; is to generate; an alert indicator), while (alert icon; generate; an alert indicator) would be better. This illustrates a pattern where "is" is prepended to predicates with propositions so they don't make sense. A key differentiator of the DocOIE dataset is the coreference resolution data. In section 3.2 they write "To gain an accurate interpretation of a sentence, the annotator needs to read a few surrounding sentences or even the entire document for relevant contexts." However, the two previously discussed datasets already include basic sentence-level coreference resolution. The authors of DocOIE could track how many resolutions are in-sentence or within the broader document (even how far away the source sentence is). This would support their approach of document-based annotation beyond the two qualitative examples provided in their introduction. Finally, the evaluation of common OpenIE systems on their dataset is limited by their use of the CaRB scorer whose flaws have been outlined above. ## Conclusion There are many ambiguities and human biases in creating an evaluation dataset, which perhaps reflects the underlying difficulty or entropy of the OpenIE task. These challenges are best understood through the guidelines for annotators. OIE2016 and DocOIE do not describe their annotation policy, CaRB provides a brief overview of their principles, while BenchIE and WiRe57 publish detailed annotation guidelines with examples. These datasets have varying purposes. CaRB and DocOIE are the largest datasets with fair quality but less robust scoring methods. BenchIE and WiRe57 are smaller, expert annotated datasets that are good for a more careful evaluation and developing new OpenIE capabilities. None of the datasets surveyed are of sufficient size to enable training neural-network based OpenIE systems, which instead train on extractions from previous OpenIE systems, resulting in a performance barrier known as the "bootstrapping problem".[@zhouSurveyNeuralOpen2022] However, these datasets distill the best practices of what an OpenIE system should be, and help us clearly evaluate new approaches and move in the right direction.

Feb 22, 2023

8 min read

Towards Unified ML Dev Tools

A few years ago I started working on turning a radio-controlled toy robotic car and into an autonomous vehicle. Thanks to great resources like [Berkeley's CS 188 - Introduction to Artificial Intelligence](https://inst.eecs.berkeley.edu/~cs188/sp23/), and [Sebastian Thrun's Artificial Intelligence for Robotics Udemy course](https://www.udacity.com/course/artificial-intelligence-for-robotics--cs373), I learned about localization and search algorithms and PID controllers. My idea was to strap a Raspberry Pi on the back, use its camera to run an ML model to do some distance/object detection and then take those outputs and give them to a traditional localization algorithm. Then we’d go from there. If I could just get it to figure out where it was or how much it moved from the "origin", I'd be happy. Keep in mind this is on an ARM CPU, with about 1GB of RAM and a crappy Qualcomm embedded GPU. At the time, TensorFlow did not distribute any binaries for architectures other than x86_64. So I tried to compile it on the RPi. Oh what a mistake that was! Burning poor CPU for 20 mins and probably not a few percent done. Then I tried to cross-compile and didn’t get that working. Anyways, while mired in these practical problems, I had a vision for how I wanted it to work. I'd write a little JSON or YAML file like a package.json that'd look something like this: ``` models: - name: obj-distance-detetction pretrained: yolo datasets: - s3://some-open-distance-dataset hooks: - name: localization input: boundingbox, distance file: localize.py ``` A kind of "package manager for ML." Here I've listed the features that I wanted and the current state of the ecosystem tools that provide them: * Pretrained weights - [Hugging Face](https://huggingface.co/models) * Model code - `transformers` library, random research repos on github * A better package manager than the [mess of python environments](https://xkcd.com/1987/) * Now [mamba](https://github.com/mamba-org/mamba) is super fast and does this pretty well, PDM is interesting too * Model data - many competitors for a “git for data” type solution e.g. pachyderm, dolthub, lakefs * Converting model weights from one format to another: ONNX * Describe/automatically do autoML or architecture search: [MLJar](https://github.com/mljar/mljar-supervised) is very easy to use, and focuses on tabular supervised learning, [AutoGluon](https://auto.gluon.ai/stable/index.html) is fast * And finally, wrap this all up in a declarative format like the above. I recently found Ludwig which does this pretty brilliantly [https://github.com/ludwig-ai/ludwig](https://github.com/ludwig-ai/ludwig) The end goal is not just to embed a single ML system in a non-ML system. It is to compose multiple ML systems. But composing ML stuff is hard, because little problems in one part propagate to the whole system and decrease overall accuracy. For example, [Dropbox composed classic computer vision and deep learning methods and described their challenges getting the end to end system to work.](https://dropbox.tech/machine-learning/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning) Recently, researchers have used LLMs as the entrypoint for many other models. One [paper](https://arxiv.org/abs/2303.17580) used ChatGPT to select and run models automatically from HuggingFace to deliver specialized results. These model combinations are especially common for image-languge problems. Could we train sub-model components in parallel and compose them properly? Could we use an ensemble of models to produce training data that could be [cleaned](https://dcai.csail.mit.edu/) or [annotated while doing online training](https://prodi.gy/)? Efficient fine-tuning approaches like LoRA can help us [build models like open source software](https://colinraffel.com/blog/a-call-to-build-models-like-we-build-open-source-software.html), forking, modifying, and even maybe pushing upstream improvements to models. Let's hope open-source can [be a true competitor to centralized ML](https://www.semianalysis.com/p/google-we-have-no-moat-and-neither).

Dec 9, 2022

3 min read

Open-Source Authentication Systems

You want to build a great new app. You know what you’ll use for your frontend, backend, database, for payments and sending emails. But what about authentication? Authentication is the process of evaluating a credential provided by a user to establish their identity. This could be a username and password, social login, or a hardware key. It's sometimes called authn (to distinguish from authorization, or authz). The user’s device will store a user ID and possibly other information, so you can know that the user is making certain requests, and provide them access to the data they are authorized to use. There are lots of commercial SaaS authentication solutions, but in this analysis, we’ll evaluate open source options. **Keycloak** is what I'd call the gold standard of open source auth. I've read on Hacker News that "you'll never get fired for using Keycloak." It's written in Java as one large application. It has the most features of any of the options here, from compliance with standards like OpenID, SAML, LDAP, etc, to password strength requirements to an entire authorization policy system as well. Downsides: Docs are not great, especially of the “guide” or “cookbook” type. The styling options are through “themes” that use CSS and a Java templating language for HTML. For most people, they would need to learn specific class names to change styles. This part of Keycloak may be less ergonomic for devs who want to build the login, signup, verification, and reset flows into the frontend themselves with whatever framework they are using. Some people have expressed concerns that with a large number of “realms,” Keycloak takes up a lot more resources. For my use-case I don’t think I’ll run into this issue By default, Keycloak uses the PBKDF2 hashing algorithm, although there is an extension to use BCrypt, which many other authn options use. This might be appealing if you want to migrate from/to Keycloak from another system. A lot of examples for KC use the Admin UI, which might be a bit off-putting for devs who want things in config files. However, it has a robust API you can use, and has new features to write configs as static files and import them, and to configure it via Kubernetes CRDs. They are working on Keycloak.X, it’s next generation, which will include better performance, cloud-native deployments, more modular architecture, a headless option and better ways to build the login UI, and more. Keycloak has raving reviews on Hacker News. **SuperTokens** is a relatively new option. It comprises a Core API service written in Java which uses a Postgres database, and several backend libraries in NodeJS, Go, Python, which talk to that core API and provide their own “frontend API” (frontend in the sense that it is in front of other backend services). You add the middleware from those libraries to your HTTP library/framework, and it creates routes like `/auth/signin`. The libs use a pattern of “recipes” for common functionality, configured in an init call to the middleware which instantiates the appropriate classes. You can pass functions to do things like send an email verification email. SuperTokens also includes several frontend libraries: a headless JS lib and a functional out-of-the-box collection of web UI components in React, among others. You can override both styles and sub-components of the interface by passing an object as props. They have some great blog posts which helped me understand the importance of good session management for proper security. Downsides: much newer, with less of a community. Docs are great for examples, not so much for API reference. I ran into some issues using the Python SDK with a FastAPI service. **AuthN** is a Go service in a single binary that offers common authentication options. Its docs are great in my opinion: the architecture explanation is really nice. Downsides: It does have a much more limited feature set. For example, it doesn’t support email verification. While I understand the desire to have a strong boundary and avoid feature creep, at the same time for a new project I want something that “just works” for most of my requirements without having to build it myself if I can help it. **ORY Stack**: Ory Kratos is an auth system. Docs seemed good but the system seemed quite complex with a lot of boilerplate configuration to do something that’s one line with SuperTokens (`providers=[GoogleProvider()]`)

Dec 24, 2021

4 min read

Async Tests in Rust

> Note: this was written when Async had just recently [landed in stable rust](https://blog.rust-lang.org/2019/11/07/Async-await-stable.html) and the library ecosystem for it was still settling down. TLDR: Use tokio's `#[tokio::test]` macro to write quick and easy tests in Rust using `async` features. ## Backstory I looked all over the internet and couldn't find a nice way to test those pieces of code that are `async` functions in Rust. BTW if you haven't read [What Color is Your Function?](https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/), do that now. ## Attempts So here's what I did. ### 1. futures-await-test I looked up "async tests rust" and found this crate: https://github.com/ngg/futures-await-test I copied their `test.rs` file and it worked fine. However, when I tried running it on my code, it failed with a `'not currently running on the Tokio runtime.'` error. ```rust #[cfg(test)] mod tests { use crate::plugin::*; use super::*; use crate::plugin::Requestor; // use futures; use futures_await_test::async_test; #[async_test] async fn get() -> AResult<()> { let mut r = super::Requestor{}; r.configure(Config{version: "".to_string()}); // futures::executor::block_on let res = r.make_request("https://google.com").await; println!("{:#?}", res?); Ok(()) } } ``` I installed `cargo-expand` and looked at the code that is generated from the macro. I couldn't see anything egregious, so I tried something else. ### 2. futures::executor::block_on I went lower level, and wrote a regular `#[test]` `fn`, but used `futures` and `block_on`. Unfortunately, I got the exact same error. ```rust #[cfg(test)] mod tests { use super::*; use crate::plugin::Requestor; use crate::plugin::*; use futures; #[test] fn get() -> AResult<()> { let mut r = super::Requestor {}; r.configure(Config { version: "".to_string(), })?; futures::executor::block_on(async { let res = r.make_request("https://google.com").await; match res { Ok(r) => println!("{:#?}", r), Err(e) => panic!(e), } }); Ok(()) } } ``` ### 3. tokio Maybe this process illustrates my relative foolishness with Rust so far, and my finally understanding the need for an `async` runtime. I guess I just thought `futures` did that automatically. I am curious if I could use a `tokio` runtime but a regular `fn` like above. The sad thing is that the `#[tokio::test]` macro is so poorly documented and hidden that it was the last thing I tried. ```rust #[cfg(test)] mod tests { use super::*; use crate::plugin::Requestor; use crate::plugin::*; #[tokio::test] async fn get() -> AResult<()> { let mut r = super::Requestor{}; r.configure(Config { version: "".to_string(), })?; let url = "https://google.com"; println!("calling {}", url); let res = r.make_request(url).await?; println!("{:#?}", res); Ok(()) } } ``` Anyways, lesson learned, and a few Rust debugging skills were luckily picked up on the way: - `cargo test -- --nocapture` - prints output from tests - `cargo test 2>/dev/null` - throws warning and errors away - `cargo expand` - view generated code from macros - Also, stacktraces from inside anything async are a pain!

Jun 24, 2020

3 min read

The Perils of CMSes

Note: This article is about the two prior versions of this blog. A while back I started my first blog, with a focus on Docker, Golang, and DevOps. I wrote a few articles on my setups for Protobuf-based development in Go. I also wrote a benchmark of Golang filesystem IO using its famed parallelism constructs. The source code to that repo is available here: <https://github.com/alexkreidler/golang-parallel-io> However, I can't find the actual article itself, or any of the other articles for that matter, because I used a CMS but didn't properly save or export my data before deleting the VM that it was hosted on. The CMS was [Ghost](https://ghost.org/) a very nice and popular platform. However, it had drawbacks: - All the data was stored in a mini SQL database, and some in configuration files colocated with the themes/templates - It had to run a fairly memory-heavy server process just to serve the site ## Markdown for the win So, gradually and somewhat grudgingly, after holding suspicions against the first generation of static site generators (SSGs) like Flask, I've come to accept that they are **the best** option for any sort of writing [^writing] or blogging. [^writing]: I actually recently discovered the joy of [Pandoc](https://pandoc.org/), a powerful Haskell-based CLI tool for converting between many document formats. It is used heavily by academics and writers of books. It has also popularized a style of Markdown known as Pandoc markdown that has support for many exciting things, like these footnotes. However, now that I've realized 11ty supports it (through the [markdown-it](https://github.com/markdown-it/markdown-it#syntax-extensions) JS library), I can write in the same format, but no longer need an external preprocessing step for my static websites. However, Pandoc still is useful for generating PDFs and EPUBs. Let's start here. The benefits are: - Static: you can deploy to S3, Netlify, GH Pages, virtually anywhere. Most of these places are free b/c you're not paying for any real server-side processing, just the costs of storage and transfer - Markdown: you'll never accidentally lose an article again if they're all in plain text. This offers an easy way to move them around: just copy the directory. And another final reccommendation from my lesson learned with writing: store it all in Git. You already use it for code, and writing is just as precious, if not more, than your code itself. For me, [Eleventy (a.k.a 11ty)](https://www.11ty.dev/) has been simply delightful. It starts very simple, just transforming the HTML, Markdown, or other source content you have. From there it allows you to define templates in a variety of languages (Nunchunk, Handlebars/Mustache, Liquid, etc). It also has a rich but understandable system for metadata and data in general. In fact, it allows generating pages based solely on data. For this version of the blog, I decided to use [the Hylia Eleventy theme](https://github.com/hankchizljaw/hylia). ## Full circle A few hours ago, I was perfectly fine with my newfound minimalistic approach to blogging and writing. Then, while looking for Eleventy themes for this blog[^theme], I learned about [Netlify CMS](https://www.netlifycms.org/)[^cms], which follows all of the principles above: Markdown, Git, should be editable on-disk like any other project. But it wraps all of that up in a nice [admin UI](https://cms-demo.netlify.com), which allows for easing creation, editing, and previewing of articles. It has a few complexities, like how users are authenticated and how they interact with your Git repository (especially for non-Github hosts). For the most part though, it promulgates the Git Flow mentality even further all the way to content management. It seems like it could offer a lot of value for teams, but for myself, to be honest, I feel it may be something that I like in theory but rarely use. However, I always imagine it could be useful when I'm not able to clone a repository but just want to use a Web UI. So, we're back to a CMS. Really? Well we know now that it really is much different. In fact, day to day, as I mentioned above, I barely use the CMS. It's just there as a nice backup. But the answer is yes: I have a blog that works well for me, looks great, loads fast, and has tooling better than anyone could have imagined back in the Jekyll days. And I might even be able to use [React components in my Markdown](https://github.com/mdx-js/mdx) if I set that up! Here comes the future. [^cms]: I had been using Netlify as a hosting service for a while, so I always assumed Netlify CMS was a paid product/feature for Netlify customers. Much to my surprise when I discover it is a completely open source solution and can be used without Netlify, although it does integrate very well with Netlify. I might have picked it up sooner if it had a different name, but otherwise, it is a well-executed project.

Apr 14, 2020

5 min read

Streaming as a Service

It seems like everyone these days is launching their own streaming service. In addition to the familiar Netflix, Hulu, and Amazon Prime Video, we've now got BBC, HBO Max, Disney+, Pluto, Crackle, Freevee, Peacock, Quibi, Plurgi, and Xohowi (Ok you got me, the last two were made up). [Apparently](https://gazette.com/arts-entertainment/all-the-major-streaming-services-from-netflix-to-disney-ranked/) there [are](https://flixed.io/us/en/complete-list-streaming-services) over 200 streaming services, [wow!](https://www.google.com/search?q=so+many+new+streaming+services&ie=utf-8&oe=utf-8) This trend leads to the fragmentation of content. People don't want to pay for 3 services just to watch the stuff they want. They might end up paying for one of their favorites, but then feel bummed out whenever something they do want to watch is available other places. For those who feel inclined to be generous to the large studios that own movie copyrights, they might rent it from a service. For those who don't, they might [return to the practice of torrenting shows and movies illegally](https://www.vice.com/en_us/article/d3q45v/bittorrent-usage-increases-netflix-streaming-sites). So how might we solve this? ## The proposal Streaming-as-a-service would provide one unified platform for publishers to host and users to stream movies and TV, allowing content owners to earn revenue based on their IP. There are multiple ways of splitting revenue: with a flat monthly fee, distributed as a percentage of total content the user has watched (like Spotify) or with various plans that are priced differently and unlock different provider's content (similar to premium cable channels or premium content on Prime Video). I would heavily favor having one flat fee and getting access to content from all providers. A startup could try this approach, and try to build out better features faster than the in-house investments of all the platforms I listed above, and license to them. Or it could be a joint venture between some major studios and tech companies. Tech companies have generally shown an openness to collaboration, unlike many other industries, even on technologies close to their core business models, largely due to the influence of open source on software development. However, these are generally software components that are not visible to users. I'm not aware of any consumer-facing products/platforms that are collaboratively run/built by tech companies. ## Pros The benefits to users are pretty obvious: - All content in one place; less hassle and confusion - You're paying less? - There's one organization to yell at when the service goes down/you want new features The benefits to the companies are maybe less so: - Cut down on in-house developers - outsource the tech to the service (they might still be paying for devs to maintain it though) - Stop duplication of work: Think about how many things go into streaming - CDNs for actual content - High availability APIs for - logins, payments, ratings, recommendations - VPN blockers - Licensing/distribution rights based on location - DRM - Support staff - DMCA/censorship by various jurisdictions/legal issues - Maybe companies want to trench it out and fight over users, possibly getting into price, promotion, or feature wars with each other. I bet they would rather not - Third-party/indie filmmakers could also submit content and get paid ## Cons There are a few issues: - Could this enable monopolistic behavior by all these content-owning companies? - Why would a streaming "tech" company want to do this -- give up all their hard-earned tech IP? Then their competitive advantage goes down the drain. Because this probably will happen at some point without them (e.g. between 3 big movie studios) - Because then they can start working on lean content production However, instead of allowing anyone to upload independent movies, they probably will have a certification process. The big studios don't want an indie movie to be able to go up and make money for free b/c then what does anyone need a studio for? They just become banks for movies, investing in a project, but then the production company figures out how to distribute it.

Apr 14, 2020

4 min read

FoundationDB: The Universal Database

If I had to use any one currently available database for every project for the rest of my life[^1], I would pick FoundationDB. Why? This is that story. ## Data Models In the beginning, there were relational databases. Everything was in a table, with each row as a new instance of a given type, and each column a property of that type. To represent the connections between items, there were foreign key columns, which referenced an ID of another item in a different table. Creating more complex than 1:1 relationships required "join tables" or other mapping schemes. [^1]: assuming there would be complex ones, and I don't mind writing a good chunk of code Then came along the document-based and graph data models. With the advent of JSON technologies on the web, people decided that it would be nice to store "unstructured" documents of data, with various nested key-value objects. Graph databases presented an appealing option for data that naturally fit that format or was inherently complex, like social, financial, or traffic networks. Each of these three core data models can be stored in a variety of ways. For example all of them can be stored in a key-value store. The tabular data model can be stored in either a columnar or row-based format, which just refers to the directionality of how the table is stored on disk or in memory. [Columnar](https://arrow.apache.org/) structures store data chunked by colunn first, with every row for a given column colocated. This is generally a better approach for datasets that require heavy statistics. The traditional row-based approach stores each row together, allowing for easy access to a specific row, common in business applications, e.g. fetching one user or product. ## The Substructure The ordered key value (OKV) store is a powerful, flexible substrata for any kind of data structure. FoundationDB provides performant `Get`, `Set`, and `GetRange` operations with transactional guarantees. This allows for the user to design a custom data model for their application in a performant and simple way. Most developers are fine living within one of the three main data models available to them in a variety of commercial or open-source databases. However, some need a combination of them, so they pick up a multi-model databases. The OKV model allows for a developer to easily implement almost any data model they could implement in-memory with a programming language, such as: - tabular - matrix/tensor - graph - document-based - linked list - trees/tries - set - stack/queue - geographic (e.g. hexagonal) A simple question is: why doesn't everyone do this if it's so much better? One might also ask: Why doesn't everyone code in C instead of Python? The answer is: yes, there is more flexibility, but it is also more complex, time-consuming, and difficult to get right. So the next issue is: how do we simplify access to an ordered key value database so that it has powerful abstractions that are simple to use but also so that the access to the metal is preserved. That's where another feature of FoundationDB comes in handy. ## Layers FoundationDB is built on the concept of layers, which are simple APIs/additional libraries that build on to the base FDB API to add additional functionality. There are a few basic layers builtin to the default FDB client libraries. They're a great concept and make a lot of things better. However there are a few problems. - Layers must be implemented as a client-side library, which ties them to the FDB SDK they are using in whichever language they are written in. Thus, they must be reimplemented across different languages to support all the same languages that the regular FDB SDK supports. - This also means that clients who want to access the database through a traditional API layer like HTTP/REST or GraphQL need to write another server to translate those API calls into FDB client API calls. - Layers may have undocumented or complex internals, which makes it harder for the programmer to understand and to access the data which is stored by the layer through the regular API. A solution: - Allow for Server-Side Layers - aka abstractions on a server that provide some of the higher level APIs - Publish the internals of the Data Model Layers as a Specification which can be implemented either on the server or client side, and allow direct access to the structure of the data. - Create API Layers - which serve an HTTP, GraphQL, etc API on top of a Data Model Layer. These are deployed as separate severs. They should allow for loadbalancing to the actual DB instances. <!-- TODO: finish this section --> <!-- ## Other issues FoundationDB is great in many ways, but also has additional issues beyond those outlined above. Its support for scalability and diverse deployment environments is limited. Unlike Etcd or other key value stores that use a consensus algorithm like Raft, it uses another method. NVM its just very complicated. ## Low level How is the KV store implemented: B-tree, LSM, etc TiKV in Rust -->

Apr 14, 2020

5 min read