Commit Graph

  • 264f27b826 fix url filters to have !insitelist directive mwells 2014-04-05 18:40:39 -07:00
  • b0dbf833a7 fix sitelist update logic. mwells 2014-04-05 18:26:00 -07:00
  • ac5cf7971b more misc updates. mwells 2014-04-05 18:09:04 -07:00
  • bd82145626 Merge branch 'diffbot-testing' into testing mwells 2014-04-05 12:34:46 -07:00
  • 89f5c8c059 Merge branch 'diffbot-matt' into diffbot-testing mwells 2014-04-05 11:34:27 -07:00
  • 61b4ec4ca6 added some qa testing logic. qa.cpp. mwells 2014-04-05 11:33:42 -07:00
  • 0988a134d0 Merge remote-tracking branch 'origin/diffbot' into diffbot-dan Daniel Steinberg 2014-04-01 19:48:24 -07:00
  • 4856cc4c60 ||, not && Daniel Steinberg 2014-04-01 10:45:54 -07:00
  • 3e38bd169e and return an error Daniel Steinberg 2014-04-01 10:43:17 -07:00
  • 94b169b8dc only delete if there were no io errors Daniel Steinberg 2014-04-01 10:42:12 -07:00
  • 6568858e81 implement something that works like mv, which tries rename first, and if that fails copies the bytes. rename doesn't work across devices Daniel Steinberg 2014-03-31 20:44:39 -07:00
  • d6434191d1 nomenclature changes to reduce collissions. name collection 'qatest123' for doing smoke tests, not 'test'. Matt Wells 2014-03-31 15:02:17 -07:00
  • 9c8410767d fix critical title alloc/free bug in title.cpp. Matt Wells 2014-03-28 08:01:01 -07:00
  • c1671015c8 Merge branch 'diffbot-dan' into diffbot-testing Matt Wells 2014-03-27 12:19:50 -07:00
  • 582349334f do not use certain other json fields when computing checksum for deduping. like stats, querystring, ... Matt Wells 2014-03-27 12:20:53 -07:00
  • 402377d2e6 fix bug of gbmin, gbmax etc. not working. floats were being rounded down to ints in most cases it seems. so .9 -> 0 etc. Matt Wells 2014-03-26 11:56:06 -07:00
  • d67f09feeb also include a timestamp field with an RFC 1123 formatted date Daniel Steinberg 2014-03-25 21:45:21 -07:00
  • 0efac8c156 Defect #2080: seed URLs duplicated Daniel Steinberg 2014-03-25 17:25:55 -07:00
  • e1b1b15a38 bigger buffer Daniel Steinberg 2014-03-25 16:34:40 -07:00
  • 9846061dff when restarting a bulk job, copy bulkurls.txt to /tmp, and then transfer it back to the new collection folder Daniel Steinberg 2014-03-25 16:20:24 -07:00
  • ab90c06d8d add TODO for regex checking Daniel Steinberg 2014-03-25 13:05:43 -07:00
  • 1ff6c1fae0 Merge remote-tracking branch 'origin/diffbot' into diffbot-dan Daniel Steinberg 2014-03-25 12:53:37 -07:00
  • b8836745f0 use SpiderRequest instead of isonsamedomain flag to determine whether to output data in CSV (Defect #2122) Daniel Steinberg 2014-03-25 12:51:08 -07:00
  • b6e5424e32 do not download bulkjob urls in crawlbot. just return a fake http reply. however, do use crawl-delay throttling logic. deduping is already turned off for bulk jobs so it should be ok. mwells 2014-03-21 12:40:38 -07:00
  • 502752aba4 doc updates mwells 2014-03-21 08:59:13 -07:00
  • b33121af7d make all field names lower case without spaces when we hash them to make the prefixhash. since json names often have mixed case field names and spaces. Matt Wells 2014-03-20 16:08:02 -07:00
  • 98a10d4936 Merge branch 'testing' into diffbot-testing Matt Wells 2014-03-20 15:50:49 -07:00
  • bbc8fc0c79 always show admin link Matt Wells 2014-03-20 15:48:51 -07:00
  • 67202f3731 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-20 15:39:03 -07:00
  • 99bd9319fd temp hack to reduce network comm between trinity and neo Matt Wells 2014-03-20 15:42:34 -07:00
  • 5ed19026d9 temp debug comments Matt Wells 2014-03-20 15:33:37 -07:00
  • b8d0e95035 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-20 10:26:55 -07:00
  • ca0843aa8b more bool query fixes. mwells 2014-03-20 10:03:25 -07:00
  • cfbec626e8 more righteous fixes for bool queries mwells 2014-03-19 13:51:32 -07:00
  • ab3368b5a0 more bool fixes. not operator support. mwells 2014-03-19 09:38:45 -07:00
  • 1bb91149d6 more bool fixes mwells 2014-03-18 14:42:50 -07:00
  • 652892dc10 more bool fixes mwells 2014-03-18 14:37:59 -07:00
  • f392826b1e nested bool query fixes mwells 2014-03-18 14:08:59 -07:00
  • b7d80fd02d more bool query fixes mwells 2014-03-18 13:41:36 -07:00
  • b31eaee9fd simple bool queries work mwells 2014-03-18 12:07:29 -07:00
  • d4302e3301 fix core Matt Wells 2014-03-18 11:12:50 -07:00
  • 3b97682cc3 more bool query fixes Matt Wells 2014-03-18 10:44:56 -07:00
  • 6e23d37e47 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-17 17:27:28 -07:00
  • 54cc8088fb more bool query fixes. hopefully this will do it, but still can do some optimizations for speed. mwells 2014-03-17 17:00:08 -07:00
  • 9d3c35ad17 nothing Matt Wells 2014-03-17 13:53:19 -07:00
  • 4abf56a75d cleanups Matt Wells 2014-03-16 18:06:22 -07:00
  • d2511d0bef host table cleanups Matt Wells 2014-03-16 17:14:47 -07:00
  • 5057fdaf14 aesthetic cleanups Matt Wells 2014-03-16 17:12:04 -07:00
  • d320bf9d75 spidering back on in main's coll.conf Matt Wells 2014-03-16 15:06:39 -07:00
  • c513ad9418 Merge branch 'diffbot' into testing Matt Wells 2014-03-16 14:51:22 -07:00
  • acd05aa740 fix a few minor bugs. /master/->/admin/ and crawl type mismatch. Matt Wells 2014-03-16 10:34:58 -07:00
  • edbd61b0c5 thread fixes. if pthread_create fails then keep thread queue and just return. will try to relaunch later. do not count delete keys towards shard rebalance count. Matt Wells 2014-03-15 20:07:02 -07:00
  • 5ca411e3e2 tuning the rebalance loop Matt Wells 2014-03-15 14:56:11 -07:00
  • 86147fe22c tight merge during rebalance to save disk space, so neg recs annihilate pos recs. Matt Wells 2014-03-14 23:37:30 -07:00
  • 6c704f6fdf Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-03-14 22:16:40 -07:00
  • e37eebd76f when rebalancing wait for merge to complete before scanning more Matt Wells 2014-03-14 22:16:25 -07:00
  • 82ac3fab6c merge fixes Matt Wells 2014-03-14 22:15:08 -07:00
  • df46a6fc1d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot-matt Matt Wells 2014-03-14 19:32:10 -07:00
  • 1f162ce7b2 update localhosts.conf too Matt Wells 2014-03-14 19:20:23 -07:00
  • 553aefdb55 keep files tightly merged when doing rebalanced to avoid running out of disk space Matt Wells 2014-03-14 19:19:41 -07:00
  • cb483c42ea more fixes for bool searching before using a slightly different and simpler approach mwells 2014-03-13 16:00:23 -07:00
  • 7812f5c746 more bool fixes. still needs a little more work mwells 2014-03-13 13:54:23 -07:00
  • 3b2d981dff more fixes for new boolean logic. mwells 2014-03-13 13:09:33 -07:00
  • fb0123ad53 nothing Matt Wells 2014-03-13 11:27:28 -07:00
  • 9acb7ef0f4 fix core &token= core Matt Wells 2014-03-13 07:57:06 -07:00
  • 7b5816f194 updated error message Daniel Steinberg 2014-03-12 20:56:27 -07:00
  • 018258bcaa Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-03-12 20:55:21 -07:00
  • fbd1bcd349 initial attempt at new boolean query logic. supports unlimited # of boolean query terms. already docid phased from phasing logic already there but could be phased more to save more mem and speed up a little more. Matt Wells 2014-03-12 20:53:44 -07:00
  • 3e7243c6ce fix add url core Matt Wells 2014-03-12 08:28:42 -07:00
  • 34f7540160 fix addurl core Matt Wells 2014-03-12 08:11:48 -07:00
  • 7ec1513d41 updates Matt Wells 2014-03-12 08:09:45 -07:00
  • f27d549fc6 Defect #2122: If a crawl and there are no urlCrawlPattern or urlCrawlRegEx values, only return URLs from that domain Daniel Steinberg 2014-03-11 19:46:38 -07:00
  • 85a5954256 only apply Defect #2099 updates if it's a bulk job. I didn't see that variable yesterday Daniel Steinberg 2014-03-11 18:52:14 -07:00
  • c81bbf6934 more informative error message Daniel Steinberg 2014-03-11 18:10:21 -07:00
  • b5be2dcf74 Merge branch 'diffbot-dan' of https://github.com/gigablast/open-source-search-engine into diffbot-dan Daniel Steinberg 2014-03-11 18:09:28 -07:00
  • 14c1b2efa3 more informative error message Daniel Steinberg 2014-03-11 18:06:42 -07:00
  • 312438a32b Merge branch 'diffbot-dan' into diffbot-testing Matt Wells 2014-03-11 17:02:59 -07:00
  • 84784d8d76 minor fixups Matt Wells 2014-03-11 17:02:24 -07:00
  • 2331b4673d Defect #2099: throw an error a crawl request was made with a name that already existed for bulk request (or the other way around) Daniel Steinberg 2014-03-11 16:21:58 -07:00
  • 8445e53c61 fix query reindex some more Matt Wells 2014-03-11 14:46:49 -07:00
  • c4b38a5c72 fix a few cores from previous code updates Matt Wells 2014-03-11 09:36:33 -07:00
  • 5c2e78e5fa Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-03-10 20:26:30 -07:00
  • 483f3c5bae fix core Matt Wells 2014-03-10 18:17:28 -07:00
  • f9fdc96563 no use in newline separating the list of urls if they're going to be read back in and need to be space separated Daniel Steinberg 2014-03-10 15:22:43 -07:00
  • e293d465a3 snprintf instead of sprintf Daniel Steinberg 2014-03-10 14:03:28 -07:00
  • 41e3988fbc not a conf file Daniel Steinberg 2014-03-10 13:57:13 -07:00
  • 4a7bf5d4d0 Story #2040: store raw URL submissions for customer bulk jobs Daniel Steinberg 2014-03-10 13:50:30 -07:00
  • bfcb7082f4 fix bug from nuking doledb on a new collection. Matt Wells 2014-03-10 13:48:00 -07:00
  • bd4484db3c Merge branch 'testing' into diffbot-testing Matt Wells 2014-03-10 12:08:23 -07:00
  • 9debee20dc Merge branch 'diffbot' into testing Matt Wells 2014-03-09 20:44:09 -07:00
  • 662b6d4b32 doc updates Matt Wells 2014-03-09 20:43:49 -07:00
  • 90ff2c2a25 update example site lists Matt Wells 2014-03-09 20:35:45 -07:00
  • 82db7240a3 simple print update Matt Wells 2014-03-09 19:43:32 -07:00
  • f7b7274ff1 replace "exact:" directive with "seed:" really the same thing. Matt Wells 2014-03-09 19:35:20 -07:00
  • f8e561e6f4 more new site list api fixes Matt Wells 2014-03-09 18:15:57 -07:00
  • 11e8c16878 new site list updates Matt Wells 2014-03-09 17:53:24 -07:00
  • ed626b162a more site list based spider fixes to be more like gsa Matt Wells 2014-03-08 20:52:31 -07:00
  • aab165ed20 fix bad return value from function Matt Wells 2014-03-08 19:32:56 -08:00
  • 4cb66c31bf get this new api spidering Matt Wells 2014-03-08 12:02:20 -07:00
  • 624c1d4e68 nuke doledb fixes Matt Wells 2014-03-08 10:51:15 -07:00