Commit Graph

  • 5988147e72 so we do not have to restart after rebuilding the unifiedDict-* files. mwells 2014-04-10 22:01:43 -0700
  • 5fb37ca5d1 a couple little fixes. Matt Wells 2014-04-10 21:59:50 -0700
  • 0362b1e923 Merge branch 'master' into diffbot-testing mwells 2014-04-10 15:09:22 -0700
  • 503cc2e34f inline widget searchbox mwells 2014-04-10 15:08:54 -0700
  • 50a4bf63af use ip/port from hosts.conf Matt Wells 2014-04-10 12:59:15 -0700
  • 0906d0ae38 update gb -h mwells 2014-04-10 00:48:27 -0700
  • bef076917d use -g for debug mode not -d, that's working dir. mwells 2014-04-10 00:36:00 -0700
  • 02304073d4 doc updates. core fixes. mwells 2014-04-10 00:31:41 -0700
  • 6675facc4f removed coll.conf mwells 2014-04-09 20:16:41 -0700
  • 539a1d188e remove coll.main.0/coll.conf mwells 2014-04-09 20:13:49 -0700
  • f55d4d1230 merge diffbot-testing mwells 2014-04-09 20:10:30 -0700
  • 1ea6c597be Merge branch 'diffbot-matt' into diffbot-testing mwells 2014-04-09 20:04:46 -0700
  • 8a003e3492 fix url filters profile logic. mwells 2014-04-09 19:51:36 -0700
  • 2adf5b9bc5 more awesome fixes mwells 2014-04-09 13:31:11 -0700
  • 72dc660598 Merge branch 'testing' into diffbot-matt mwells 2014-04-09 11:18:39 -0700
  • be99155986 more updates mwells 2014-04-09 11:03:31 -0700
  • 9e1199f113 hack about 35%ish done mwells 2014-04-08 19:34:43 -0700
  • 41284bcf4f add diffbot support to admin doc mwells 2014-04-07 14:24:52 -0700
  • b3fcfb1ab0 updated admin.html mwells 2014-04-06 21:19:39 -0700
  • 1b5c6a6278 create hosts.conf into cwd if not there. pretty up logging system. update admin.html mwells 2014-04-06 21:12:52 -0700
  • 5ee79a4c2f daemonize on ./gb 0 etc. mwells 2014-04-06 15:57:38 -0700
  • 9b359aa876 Merge branch 'master' into diffbot-testing Matt Wells 2014-04-06 14:41:03 -0700
  • f2a23f7dd3 Merge branch 'master' into diffbot-testing Matt Wells 2014-04-06 14:39:48 -0700
  • c20c30c53f Merge branch 'testing' of github.com:gigablast/open-source-search-engine into testing mwells 2014-04-06 14:03:13 -0700
  • 23e5a94ddf move log file in the binary itself now. mwells 2014-04-06 14:02:51 -0700
  • 4e6db38517 quick start doc update mwells 2014-04-05 20:33:47 -0700
  • fa7216f978 Merge branch 'testing' mwells 2014-04-05 19:25:35 -0700
  • 5ff88fafbc spider status updates mwells 2014-04-05 18:52:40 -0700
  • 264f27b826 fix url filters to have !insitelist directive mwells 2014-04-05 18:40:39 -0700
  • b0dbf833a7 fix sitelist update logic. mwells 2014-04-05 18:26:00 -0700
  • ac5cf7971b more misc updates. mwells 2014-04-05 18:09:04 -0700
  • bd82145626 Merge branch 'diffbot-testing' into testing mwells 2014-04-05 12:34:46 -0700
  • 89f5c8c059 Merge branch 'diffbot-matt' into diffbot-testing mwells 2014-04-05 11:34:27 -0700
  • 61b4ec4ca6 added some qa testing logic. qa.cpp. mwells 2014-04-05 11:33:42 -0700
  • 0988a134d0 Merge remote-tracking branch 'origin/diffbot' into diffbot-dan Daniel Steinberg 2014-04-01 19:48:24 -0700
  • 4856cc4c60 ||, not && Daniel Steinberg 2014-04-01 10:45:54 -0700
  • 3e38bd169e and return an error Daniel Steinberg 2014-04-01 10:43:17 -0700
  • 94b169b8dc only delete if there were no io errors Daniel Steinberg 2014-04-01 10:42:12 -0700
  • 6568858e81 implement something that works like mv, which tries rename first, and if that fails copies the bytes. rename doesn't work across devices Daniel Steinberg 2014-03-31 20:44:39 -0700
  • d6434191d1 nomenclature changes to reduce collissions. name collection 'qatest123' for doing smoke tests, not 'test'. Matt Wells 2014-03-31 15:02:17 -0700
  • 9c8410767d fix critical title alloc/free bug in title.cpp. Matt Wells 2014-03-28 08:01:01 -0700
  • c1671015c8 Merge branch 'diffbot-dan' into diffbot-testing Matt Wells 2014-03-27 12:19:50 -0700
  • 582349334f do not use certain other json fields when computing checksum for deduping. like stats, querystring, ... Matt Wells 2014-03-27 12:20:53 -0700
  • 402377d2e6 fix bug of gbmin, gbmax etc. not working. floats were being rounded down to ints in most cases it seems. so .9 -> 0 etc. Matt Wells 2014-03-26 11:56:06 -0700
  • d67f09feeb also include a timestamp field with an RFC 1123 formatted date Daniel Steinberg 2014-03-25 21:45:21 -0700
  • 0efac8c156 Defect #2080: seed URLs duplicated Daniel Steinberg 2014-03-25 17:25:55 -0700
  • e1b1b15a38 bigger buffer Daniel Steinberg 2014-03-25 16:34:40 -0700
  • 9846061dff when restarting a bulk job, copy bulkurls.txt to /tmp, and then transfer it back to the new collection folder Daniel Steinberg 2014-03-25 16:20:24 -0700
  • ab90c06d8d add TODO for regex checking Daniel Steinberg 2014-03-25 13:05:43 -0700
  • 1ff6c1fae0 Merge remote-tracking branch 'origin/diffbot' into diffbot-dan Daniel Steinberg 2014-03-25 12:53:37 -0700
  • b8836745f0 use SpiderRequest instead of isonsamedomain flag to determine whether to output data in CSV (Defect #2122) Daniel Steinberg 2014-03-25 12:51:08 -0700
  • b6e5424e32 do not download bulkjob urls in crawlbot. just return a fake http reply. however, do use crawl-delay throttling logic. deduping is already turned off for bulk jobs so it should be ok. mwells 2014-03-21 12:40:38 -0700
  • 502752aba4 doc updates mwells 2014-03-21 08:59:13 -0700
  • b33121af7d make all field names lower case without spaces when we hash them to make the prefixhash. since json names often have mixed case field names and spaces. Matt Wells 2014-03-20 16:08:02 -0700
  • 98a10d4936 Merge branch 'testing' into diffbot-testing Matt Wells 2014-03-20 15:50:49 -0700
  • bbc8fc0c79 always show admin link Matt Wells 2014-03-20 15:48:51 -0700
  • 67202f3731 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-20 15:39:03 -0700
  • 99bd9319fd temp hack to reduce network comm between trinity and neo Matt Wells 2014-03-20 15:42:34 -0700
  • 5ed19026d9 temp debug comments Matt Wells 2014-03-20 15:33:37 -0700
  • b8d0e95035 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-20 10:26:55 -0700
  • ca0843aa8b more bool query fixes. mwells 2014-03-20 10:03:25 -0700
  • cfbec626e8 more righteous fixes for bool queries mwells 2014-03-19 13:51:32 -0700
  • ab3368b5a0 more bool fixes. not operator support. mwells 2014-03-19 09:38:45 -0700
  • 1bb91149d6 more bool fixes mwells 2014-03-18 14:42:50 -0700
  • 652892dc10 more bool fixes mwells 2014-03-18 14:37:59 -0700
  • f392826b1e nested bool query fixes mwells 2014-03-18 14:08:59 -0700
  • b7d80fd02d more bool query fixes mwells 2014-03-18 13:41:36 -0700
  • b31eaee9fd simple bool queries work mwells 2014-03-18 12:07:29 -0700
  • d4302e3301 fix core Matt Wells 2014-03-18 11:12:50 -0700
  • 3b97682cc3 more bool query fixes Matt Wells 2014-03-18 10:44:56 -0700
  • 6e23d37e47 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-17 17:27:28 -0700
  • 54cc8088fb more bool query fixes. hopefully this will do it, but still can do some optimizations for speed. mwells 2014-03-17 17:00:08 -0700
  • 9d3c35ad17 nothing Matt Wells 2014-03-17 13:53:19 -0700
  • 4abf56a75d cleanups Matt Wells 2014-03-16 18:06:22 -0700
  • d2511d0bef host table cleanups Matt Wells 2014-03-16 17:14:47 -0700
  • 5057fdaf14 aesthetic cleanups Matt Wells 2014-03-16 17:12:04 -0700
  • d320bf9d75 spidering back on in main's coll.conf Matt Wells 2014-03-16 15:06:39 -0700
  • c513ad9418 Merge branch 'diffbot' into testing Matt Wells 2014-03-16 14:51:22 -0700
  • acd05aa740 fix a few minor bugs. /master/->/admin/ and crawl type mismatch. Matt Wells 2014-03-16 10:34:58 -0700
  • edbd61b0c5 thread fixes. if pthread_create fails then keep thread queue and just return. will try to relaunch later. do not count delete keys towards shard rebalance count. Matt Wells 2014-03-15 20:07:02 -0700
  • 5ca411e3e2 tuning the rebalance loop Matt Wells 2014-03-15 14:56:11 -0700
  • 86147fe22c tight merge during rebalance to save disk space, so neg recs annihilate pos recs. Matt Wells 2014-03-14 23:37:30 -0700
  • 6c704f6fdf Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-03-14 22:16:40 -0700
  • e37eebd76f when rebalancing wait for merge to complete before scanning more Matt Wells 2014-03-14 22:16:25 -0700
  • 82ac3fab6c merge fixes Matt Wells 2014-03-14 22:15:08 -0700
  • df46a6fc1d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot-matt Matt Wells 2014-03-14 19:32:10 -0700
  • 1f162ce7b2 update localhosts.conf too Matt Wells 2014-03-14 19:20:23 -0700
  • 553aefdb55 keep files tightly merged when doing rebalanced to avoid running out of disk space Matt Wells 2014-03-14 19:19:41 -0700
  • cb483c42ea more fixes for bool searching before using a slightly different and simpler approach mwells 2014-03-13 16:00:23 -0700
  • 7812f5c746 more bool fixes. still needs a little more work mwells 2014-03-13 13:54:23 -0700
  • 3b2d981dff more fixes for new boolean logic. mwells 2014-03-13 13:09:33 -0700
  • fb0123ad53 nothing Matt Wells 2014-03-13 11:27:28 -0700
  • 9acb7ef0f4 fix core &token= core Matt Wells 2014-03-13 07:57:06 -0700
  • 7b5816f194 updated error message Daniel Steinberg 2014-03-12 20:56:27 -0700
  • 018258bcaa Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-03-12 20:55:21 -0700
  • fbd1bcd349 initial attempt at new boolean query logic. supports unlimited # of boolean query terms. already docid phased from phasing logic already there but could be phased more to save more mem and speed up a little more. Matt Wells 2014-03-12 20:53:44 -0700
  • 3e7243c6ce fix add url core Matt Wells 2014-03-12 08:28:42 -0700
  • 34f7540160 fix addurl core Matt Wells 2014-03-12 08:11:48 -0700
  • 7ec1513d41 updates Matt Wells 2014-03-12 08:09:45 -0700
  • f27d549fc6 Defect #2122: If a crawl and there are no urlCrawlPattern or urlCrawlRegEx values, only return URLs from that domain Daniel Steinberg 2014-03-11 19:46:38 -0700